Hacker Pig Latin: A Base64 Primer for Security Analysts

Cybersecurity Analytics

Cybersecurity In-Depth: Feature articles on security strategy, latest trends, and people to know.

The Base64 encoding scheme is often used to hide the plaintext elements in the early stages of an attack that can't be concealed under the veil of encryption. Here's how to see through its tricks.

Daniel Smallwood, Senior Threat Research Engineer, IronNet

January 21, 2021

10 Min Read

(image by Daniel Berkman, via Adobe Stock)

Figure 1: (image by Daniel Berkman, via Adobe Stock)

If you have young kids, you'll relate to the value of being able to speak in code. For a few years, I was able to use Pig Latin to speak covertly around my kids. It was handy, and surprisingly effective, until they decoded the scheme and began speaking Pig Latin in front of me. Artypay overray.

I think about this every time I witness attacks where pieces are encoded (not encrypted) to hide what's going on. One of these encodings is typically Base64. Why is this so common? Most machines speak Base64, but most security analysts don't.

In this post, I'll explain what the Base64 encoding scheme is, then discuss how it's used both for good and evil intent. Next, I'll look at some common detection applications of Base64 and where they sometimes fall short, giving advice on what you can do to strengthen them. Finally, I'll address some other encoding algorithms in the wild to help round out the topic and perhaps give you the ability to dead reckon when the bad guys may be trying to hide something.

What Is Base64?
Base64 is an encoding scheme that can take any binary input and represent it using a set of 64 ASCII characters. It's important to note that Base64 is not encryption; it's an encoding scheme, so decoding it is trivial. Simple, free Base64 encode/decode tools are easy to find online.

Encoding in Base64 is an inflationary operation: the 11-character input string "Hello World" converts to 16 characters in Base64.

"Hello World" --( Base64Encode )--> "SGVsbG8gV29ybGQ="

Figure 4: Base64 Table Credit: Wikimedia Commons Credit: Wikimedia Commons

Let's look at a few more Base64 strings.

"Secret string" => U2VjcmV0IHN0cmluZw==

"Be sure to drink your ovaltine" => QmUgc3VyZSB0byBkcmluayB5b3VyIG92YWx0aW5l

They all contain characters from the set [A-Za-z0-9/+] and can end with 0-2 equal signs. Why these characters? The purpose of Base64 is to encode anything (namely binary data) into the characters that are carried easily by text-only protocols.

For example, e-mail was originally only designed to carry text data. As e-mail evolved, the protocols that delivered email didn't. Attaching binary documents like pictures and media files was not possible. The path of least resistance to allow email to progress was to create a binary-to-text encoding scheme rather than altering the protocol.

One facet of the SMTP protocol that makes this clear is the end of message indicator. In SMTP, the signal an email client uses to show the end of a message is for it to supply a single line that contains only a period. (SMTP Protocol implementation details, although long, are surprisingly easy to read: https://tools.ietf.org/html/rfc2821. Isn't this period trick odd? What if an email author wanted to send a single line with a period in their email?) Send arbitrary (non-text) data as part of the message body and you could possibly interfere with this protocol feature, and likely others, too.

Another common legitimate example of Base64 use is embedding raw binary data (e.g., images) in-line with html pages. HTML is a text-only protocol after all, and if you want to carry an image right in the page, versus by hyperlink for the browser to grab on a separate connection, Base64 is your answer.

Why Might Attackers Use Base64?
Base64 is often used to hide the plaintext elements of an attack that can't be concealed under the veil of encryption. Look for Base64 use in early stages of attacks, when the breach is narrow.

Using real encryption is hard during early attack stages because encryption requires tooling and key exchange. The adversary can't guarantee that the required cryptographic tools will be available and accessible on the victim host to decrypt anything. But Base64 tooling is far more ubiquitous.

Even if we presume tooling to not be an issue, carrying a symmetric key with the encrypted payload defeats the purpose; asymmetric keys aren't a solution as this requires both infrastructure and further exposure. When an adversary uses encryption, it usually occurs later in the attack and piggybacks over third-party infrastructure.

Examples:
Let's walk through a simple encoding exercise. Encoding the two letters "IN" into Base64 becomes "SU4=". Like so:

Figure 5:

Important!
There's one BIG takeaway to absorb as you look at the tables and example: There is no direct translation between ASCII and its Base64 equivalent. In other words, the character "A" translates to three different representations in Base64 depending on what the offset is. It's this misunderstanding that lies at the root of the problem with many Base64 security detections.

These encoded strings illustrate this:

"Secret string" =Base64=> U2VjcmV0IHN0cmluZw==

"ASecret string" =Base64=> QVNlY3JldCBTdHJpbmc=

"AASecret string" =Base64=> QUFTZWNyZXQgU3RyaW5n

--- As we prepend more characters now, things begin repeating ---

"AAASecret string" =Base64=> QUFBU2VjcmV0IFN0cmluZw==

...

A detection that looks at the Base64 version of "Secret String" must consider that it has three representations.

Common Base64 Analysis Techniques and Oversights
When analyzing a string believed to be nefarious plaintext data hidden with Base64, it's important to remember that the suspect string may be only a fragment. It might be necessary to add padding to the beginning and adjust padding at the end to get the decoded text out. Let's look at an example using the CyberChef tool.

("Analysis techniques and oversights," continued on page 2 of 2)

(continued from page 1)

Our suspect string is:

ldCBodHRwOi8vMTAuMS4yLjMvdG9vbGtpdHMvbm90aGluZ190b19zZWVfaGVyZS5iaW4=

Step 1: Adjust Trailing Padding if Necessary
We put the suspect string into CyberChef and choose the "From Base64" recipe, which produces the error: "Data is not a valid byteArray." Adjust the number of trailing "=" from 0-2 until the error goes away. In this example, deleting the "=" allows for decoding.

Figure 6: CyberChef

Step 2: If Plaintext Isn't Apparent, Prepend Some Characters
If the output looks to be binary and you suspect text, don't give up yet. Add some characters to the beginning to see if it's simply a bit alignment problem due to truncated data. You can use any valid Base64 character here, but consider using the "/" as the injected padding tends to stand out better (unless the first encoded character is already a "/"). From our test string, three padding characters caused the plaintext to be revealed.

Figure 7:

Where Will I See Base64?
A security analyst will encounter Base64 encoded strings in a variety of places.

The routine and most common places come from examining mail attachments and embedded content (mostly images) from web pages. Other places should cause analysts to be on alert -- for instance, when Base64 strings are detected on the command line.

Below is an example of a reverse shell hiding in plain sight using a powershell command. (Ref: mkpsrevshell.py, https://gist.github.com/tothi/ab288fb523a4b32b51a53e542d40fe58.) This leverages the "-e / -EncodedCommand" feature of powershell that allows a Base64 string to be passed in. Powershell will decode the Base64, then execute the script inside.

Figure 8: Ref mkpsrevshell.py https://gist.github.com/tothi/ab288fb523a4b32b51a53e542d40fe58

The behavior of spawning a process with Base64 reflected on the command line by itself is suspicious. If you're monitoring Windows process creation, you should inspect when you see that happening.

Let's look at another common oversight spotted in a Sigma IDS rule. The rule fragment below is published to Sigma and looks for a particular Base64 string (among other things, see full rule for that):

Figure 9:

This rule contains a detection element if the string '"L3NlcnZlc" is observed. According to the rule, this string translates to "/server=." In fact, it falls a bit short. If we use CyberChef, we notice that it actually translates to "/servet" a mistake/bug introduced probably from the input string carrying a trailing "=" sign. Now that we are savvy Base64 sleuths, we can update this rule to the correct string: "L3NlcnZlcj0=." And also using our knowledge of the bit offset problem, add the two other Base64 variants that will detect the same thing: "y9zZXJ2ZXI9," "c2VydmVyPQ."

Another common Base64 exposure for security analysts is examining HTTP Basic Authentication. (Maybe this isn't as "common" as it used to be, but I'm pretty sure every security analyst has seen at least one of these alerts fire.) Here's an example of an HTTP header using it. The problem here is now pretty obvious. This is a plain-text password. HTTP basic auth carries the convention of Base64 encoded "username:password" in the "Authorization" client header. This example decodes to "joeuser:very$ecure."

Figure 10:

Other Encoding Schemes
If you're a security analyst, at this point you may have realized a great evil application for Base64: data exfiltration over DNS! But there are a couple problems here. First, the defined character set for Base64 includes characters not allowed in DNS strings (+, /, =). Second, DNS is case-insensitive. An adversary couldn't guarantee that their Base64 encoded subdomain wouldn't get "lowered" along the way. But … there's always Base32! Base32 is very similar to Base64 encoding, except it carries data when we can't use upper/lowercase to encode information. Base32 is even more inflationary than Base64, so encoding large amounts of data for exfiltration using Base32 is surely to be a very loud network event.

Don't forget, too, that Base16 (hex) and Base2 (binary) are also valid encoding schemes with early access tooling available. Security analysts see these everywhere as part of their daily exposure but rarely as part of an adversary technique to analyze like Base64.

Variants of Base64 use different alphabets. For instance, there's a "filename safe" variant that substitutes the "/" for a "-." So just because you see something that looks like a Base64 string but has an "-" in it, don't discount it too quickly. The CyberChef tool demonstrated earlier can be configured for these alternate alphabets.

Summary
We explored Base64 encoding from the security analyst's perspective. Base64 encoding is traditionally used to convert binary data to printable text characters, but it can also be used to hide plaintext. Security analysts should keep these common techniques in mind while performing investigations, as all too often encoding plaintext as Base64 is enough to allow the best detection engine to miss (our eyes).

Once understood, Base64 detection flaws can be identified and signatures/logic improved to reflect all possible permutations.

About the Author(s)

Daniel Smallwood

Senior Threat Research Engineer, IronNet

Daniel Smallwood is a Senior Threat Research Engineer at IronNet, the company bridging the gap between traditional cybersecurity approaches and the modern threat. Prior to IronNet, Daniel spent more than 18 years in security and software development for companies including Alert Logic, Click Security, Jask, SumoLogic and others.

See more from Daniel Smallwood

Related Topics

Related Topics

Related Topics

Related Topics

About the Author(s)