A 15-year-old flaw in the Python open source programming language has remained unpatched in many places, making its way into hundreds of thousands of both open source and closed source projects worldwide. This is inadvertently creating a broadly vulnerable software supply chain that most affected organizations are unaware of, researchers warned.
That's according to the Trellix Advanced Research Center, whose analysts found that a path traversal-related vulnerability, tracked as CVE-2007-4559, presently remains unpatched in more than 350,000 unique open source repositories, leaving software applications vulnerable to exploit.
In a blog post published Sept. 21, principal engineer and director of vulnerability research Douglas McKee said that the code base in question is present in software that spans a vast number of industries — primarily software development, artificial intelligence/machine learning, and code development, but also including sectors as diverse as security, IT management, and media.
The Python tarfile module also exists in a default module in any project using Python, and is currently found extensively in frameworks created by AWS, Facebook, Google, Intel, and Netflix, as well as applications used for machine learning, automation, and Docker containerization, researchers said.
While the bug allows attackers to escape the directory that a file is supposed to be extracted to, actors can also exploit the flaw to execute malicious code, researchers said.
"Today, left unchecked, this vulnerability has been unintentionally added to hundreds of thousands of open- and closed-source projects worldwide, creating a substantial software supply chain attack surface," McKee said.
New Problem, Old Vulnerability
After finding that Python's tarfile module wasn't properly checking for path traversal vulnerabilities in an enterprise device recently, Trellix researchers thought they had stumbled across a new zero-day Python vulnerability, McKee wrote in the post. However, they soon realized that the flaw was one that had already been discovered.
Further digging and later cooperation from GitHub revealed that there are about 2.87 million open source files that contain Python’s tarfile module in about 588,000 unique repositories. Results of Trellix analysis found that about 61% of those instances are vulnerable, which led researchers to a current estimate of 350,000 vulnerable Python repositories.
In Open Source, There's No One to Blame
There are a number of reasons that the flaw has been able to spread throughout software unchecked for so long; however, it would be unfair to put specific blame on the Python project, various maintainers of the project, or any developers using the platform, McKee noted.
"Let's start by being explicitly clear — there is no one party, organization, or person to blame for the current state of CVE-2007-4559, but here we are anyway," he wrote.
Because open source projects like Python are run and maintained by a nebulous group of volunteers and not one federated organization — and in this case, a nonprofit foundation, to boot — it's harder to track and fix even known issues in a timely manner, McKee observed.
Further, "it is not uncommon for libraries or software development kits … to consider the responsibility for securely leveraging their APIs as part of the developer's responsibility," he said.
Indeed, Python has put a warning in its documentation of the tarfile function about the risks of using it, explicitly telling developers never to "extract archives from untrusted sources without prior inspection" due to the directory traversal issue.
While a warning is "a positive step" toward spreading awareness of the issue, it clearly hasn't prevented the vulnerability from being perpetuated, since it's still up to developers leveraging the code base to ensure that the software they build is secure, McKee observed.
He added that exacerbating the problem is the fact that most of the Python tutorials for developers on how to use the platform's modules — including Python's own documentation and popular sites like tutorialspoint, geeksforgeeks, and askpython.com — aren't clear on how to avoid insecure use of the tarfile module, he noted.
This discrepancy has allowed the vulnerability to be programmed into the supply chain, a trend that will likely continue for years to come unless there's broader awareness of the problem, McKee noted.
'Incredibly Easy' to Exploit the Flaw
On the technical front, CVE-2007-4559 is a path traversal attack in Python's tarfile module that allows an attacker to overwrite arbitrary files, by adding the ".." sequence to filenames in a TAR archive.
The actual flaw arises from two or three lines of code using unsanitized tarfile.extract() or the built-in defaults of tarfile.extractall(), noted Trellix vulnerability researcher Charles McFarland in a separate blog post on the problem published Wednesday.
"Failure to write any safety code to sanitize the members' files before calling or tarfile.extract() tarfile.extractall() results in a directory traversal vulnerability, enabling a bad actor access to the file system," he wrote.
For an attacker to take advantage of this vulnerability they need to add ".." with the separator for the operating system ("/" or "\") into the file name to escape the directory the file is supposed to be extracted to, Schulz detailed. Python's tarfile module lets developers do exactly that, he noted.
Trellix vulnerability research intern Kasimir Schulz — whose research on a separate issue is actually responsible for bringing the extensive Python tarfile bug to light — described in detail in a third separate Trellix blog post he wrote published Wednesday how "incredibly easy" it is to exploit CVE-2007-4559.
Tarfiles in Python contain a collection of multiple different files and metadata that's later used to unarchive the tarfile itself, Schulz explained in his post. The metadata contained within a TAR archive includes but is not limited to information such as the file name, the size and checksum of the file, and information about the owner of the file when the file was archived.
"The tarfile module lets users add a filter that can be used to parse and modify a file's metadata before it is added to the TAR archive," Schulz wrote. This enables attackers to create their exploits with as little as six lines of code, he said.
Schulz goes on in his post to explain in detail how he used the flaw and a custom-built script called Creosote — which searches through directories scanning for and then analyzing Python files — to execute malicious code within Spyder IDE, a free and open source scientific environment written for Python that can be run on Windows and macOS.
Spotlight on the Supply Chain
The tarfile issue once again highlights the software supply chain as an attack surface, one that has risen in prominence in recent years due to the broad impact attackers can have by targeting flawed code that's present across multiple platforms and thus enterprise environments. This can serve to expansively widen the impact of malicious campaigns without extra work on the part of threat actors.
There have been numerous examples already of what can happen across the supply chain in these types of attacks, with the now-infamous SolarWinds and Log4J scenarios being among the most prominent. The former started in late December 2020 with a breach in the SolarWinds Orion software and spread deep into the next year with multiple attacks across various organizations. The latter saga unfolded in early December 2021 with the discovery of a flaw dubbed Log4Shell in a widely used Java logging tool that spurred multiple exploits and made millions of applications vulnerable to attack, many of which remain unpatched today.
Lately, attackers have begun to see the benefit of going directly after open source code repositories to plant their own malicious code that can be exploited later for supply chain attacks. In fact, the Python project has found itself directly in the crosshairs.
In late August, attackers targeted users of the Python Package Index (PyPI) with their first-ever phishing attack aimed at stealing users' credentials so threat actors could load compromised packages to the repository. Earlier that month, PyPI already had removed 10 malicious code packages from the registry after a security vendor warned that threat actors were embedding malicious code into the package installation script.