The Trojan Source attack method, identified by University of Cambridge researchers, tricks compilers into reading hidden Unicode characters and generating binaries with extra instructions and backdoors that a developer or security analyst doesn't know about. Because the special characters are not visible by default, the malicious code is unlikely to be discovered during code review.
Attacks based on how Unicode displays text are not new, but one reason why Trojan Source may feel like a bigger deal is because of the sheer amount of code that gets copy and pasted from public sites — such as StackOVerflow, GitHub, and other centralized forums — into the individual source code files. If problematic Unicode characters are hidden in the file, those are getting copied in, as well.
“This scenario demonstrates the proactive power of source code reviews, and it would be a good best practice not to copy and paste code for the time being,” says Jon Gaines, senior application consultant at nVisium. “It's always better to rewrite it yourself.”
Make Unicode Visible
Developers can detect potentially malicious Unicode characters by enabling the IDE or text editors they are working with to display Unicode. Or they can use a command-line hex editor, such as HexEd.It, and search for specific Unicode characters in the file, Gaines says.
Major source control platforms have already responded: GitHub, GitLab, and Atlassian (for BitBucket) already post alerts for the Unicode BiDi characters (CVE-2021-42574).
One way to deal with the fact that text editor Visual Studio Code can be tripped up by this attack is to change the encoding to non-unicode. That will show the malicious Unicode characters (for the BiDi characters) as mangled characters, says Shachar Menashe, senior director of security research at JFrog. The mangled characters should get caught during a manual code review.
This is what Unicode BiDi would look like once the change is made in Visual Studio Code:
There are homoglyphs that are difficult to differentiate from legitimate characters. This is how those characters would appear once the change is made in Visual Studio Code:
Menashe says Visual Studio, Notepad++, and Sublime Text actually aren't affected by BiDi characters in a vulnerable way, as the line is either mangled or the entire line shows up as a comment:
Filter Out the Characters
The Trojan Source methods will have “minimal security impact in the real world” because regular source code typically does not contain the special Unicode characters outlined by the researchers (BiDi and homoglyphs), says Menashe. They are “easy to detect, alert on, and perhaps even filter out automatically,” he says.
The following Linux commands can either alert on or strip out all Unicode characters from an individual source code file:
- Alert: iconv -f utf-8 -t ascii input.cpp
- Strip: iconv -c -f utf-8 -t ascii input.cpp -o filtered_output.cpp
Alternatively, this Linux command will check a list of files and flag instances where the special characters are found:
- for file in filelist; do hexdump -C “$file” | grep RTLcharacters; done
Instead of just alerting, the following commands can strip out only the specific characters targeted in Trojan Source from the individual code file.
The following two Linux commands strip out Unicode BiDi characters (CVE-2021-42574):
- CHARS=$(python -c 'print u"\u202A\u202B\u202D\u202E\u2066\u2067\u2068\u202C\u2069".encode("utf8")')
- sed 's/['"$CHARS"']//g' < input.cpp > filtered_output.cpp
For Unicode Homoglyph characters (CVE-2021-42694), these two commands form a partial list for stripping Cyrillic homoglyphs only:
- CHARS=$(python -c 'print u"\u0405\u0406\u0408\u0410\u0412\u0415\u0417\u041D\u0420\u0421\u0422\u0425\u0430\u0440\u0441\u0443\u0445\u0455\u04AE\u04BB\u04C0".encode("utf8")')
- sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt
Update the Tools
Install the updates for the compilers as they become available to block the attack method. But the commands to automatically detect and sanitize the files would mitigate the issues until the updates are applied. While it is possible to perform a manual source code audit to look for these special characters after changing the text-editor settings, that would be the “worst way to handle this issue,” Menashe says, since some of the characters can be indistinguishable in some cases from legitimate Latin characters.
“The best solution is to run automated tools that alert and/or strip these characters,” he says.
Individual audits of files won't scale very well in organizations with large codebases. Red Hat has released a simple Python script to identify potential issues across an entire codebase. The script can be integrated into continuous integration/continuous delivery workflows or added as a pre-commit check to ensure malicious code does not enter production.
Bob Rudis, chief security data scientist of Rapid7, also recommends a straightforward mitigation: "Disallow BiDi directives in your code base if you're writing in only English or only Arabic," he wrote in a blog post.
And despite the Common Vulnerability Scoring System rating of 9.8, there is no reason to go into firefighting mode. The 9.8 score is “overblown,” Rudis said. To exploit this weakness, the adversary would need to have direct access to developers’ workstations, source code management system, or continuous integration pipelines.
“If an attacker has direct access to your source code management system, frankly, you probably have bigger problems than this attack,” Rudis stated. "We advise prioritizing truly critical patches and limiting service and system exposure before worrying about source code-level attacks that require local or physical access."