Analytics // Threat Intelligence
8/15/2013
11:35 PM
Connect Directly
RSS
E-Mail
50%
50%

Researchers Seek Better Ways To Track Malware's Family Tree

Following a program's evolution back to the author may not yet be a reality, but computer scientists are searching for more accurate measures of the relationships between software versions

Using basic features of software programs, researchers from Carnegie Mellon University have been able to organize related code into family trees, connecting initial versions to subsequent updates by using techniques that could allow malware analysts to more quickly triage unknown threats.

In a paper presented at the USENIX Security Conference this week, the three researchers -- Jiyong Jang, Maverick Woo, and David Brumley of Carnegie Mellon University -- used software features to graph the relationships between program code and track the evolution of benign and malicious software programs. The researchers created both a program, known as iLINE, to construct trees and graphs of related program versions, and a system, known as iEVAL, to evaluate the correctness of their family trees.

The researchers found that even basic measures -- such as file size and similarities among code snippets -- can be used to organize software by its lineage, from a root program to subsequent versions. Using file and section size for benign programs yielded about 95 percent accurate family trees, while using similarities in small sections of code yielded up to 96 percent.

"In the beginning, we thought constructing a lineage would be impossible," says Woo, a system scientist at Carnegie Mellon's CyLab. "But after working on it, we found that even simple techniques worked quite well."

Classifying malware into families is a necessary step for security-software firms to reduce the workload on their analysts. By creating a single signature or pattern that matches every malware variant, antivirus firms can create more efficient software and reduce their workloads.

Yet after classifying malware into a group, the next step is to determine which program should represent the family of malware, says Jiyong Jang, a co-author of the paper who just completed his PhD in electrical and computer engineering at CMU. The parent of all the programs, known as the root, is a good choice, he says.

"If you have constructed a lineage, you can determine the root and study that program," Jang says.

A major attraction of tracing the history of a malware family tree is the hope that it might lead to the root -- the developer who created the malware, says Jason Lewis, chief scientist with Lookingglass Cyber Solutions, a threat-intelligence firm

"I think the usefulness here is trying to determine who is writing the code," he says, adding that attributing code to a specific group of actors -- even if their identities are unknown -- can still be useful. "Is this someone who is interested just in stealing money and bank account information, or is this someone who is more of a state-level actor?"

[Malware writers go low-tech in their latest attempt to escape detection, waiting for human input -- a mouse click -- before running their code. See Automated Malware Analysis Under Attack.]

Attribution is typically a labor-intensive manual process. Using the automated techniques shown off by the CMU, researchers could allow malware analysts to look for complex code structures or algorithms that indicate a developer with more technical chops than the average online bank thief, Lewis says.

Yet any such analysis has to be careful not to draw too many conclusions from tenuous relationships, says Joe Stewart, director of malware research for Dell Secureworks, a managed security service provider. Two similar sets of code may not indicate a single author, but could mean that two developers collaborated, one author copied code from another, or that the two projects both used a third common library.

The researchers noted other pitfalls as well. Incorrectly guessing the root program, for example, can nearly halve the accuracy of the resulting family tree. And while the size of each section of code was a good way to match up malicious code, file size was not.

In the end, the techniques could prove useful if used to highlight interesting similarities between code so that analysts can follow up on them, Stewart says.

"You have to be careful what you infer out of [these analyses] ... but it could lead you to connections that you might not see otherwise," Stewart says. "Having something that automates the matching of software features is a good piece of the [analysis] puzzle."

Woo and Jang aim to improve their system and identify more interesting features that can be automatically extracted from program binaries, at runtime and from source code. While current methods with basic features does quite well, the aim is to create a more robust method of inferring the relationships between code and exceed 95 percent accuracy.

"We are all competing on the last 5 percent, so we need to construct more meaningful features," Woo says.

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message. Robert Lemos is a veteran technology journalist of more than 16 years and a former research engineer, writing articles that have appeared in Business Week, CIO Magazine, CNET News.com, Computing Japan, CSO Magazine, Dark Reading, eWEEK, InfoWorld, MIT's Technology Review, ... View Full Bio

Comment  | 
Print  | 
More Insights
Register for Dark Reading Newsletters
Partner Perspectives
What's This?
In a digital world inundated with advanced security threats, Intel Security seeks to transform how we live and work to keep our information secure. Through hardware and software development, Intel Security delivers robust solutions that integrate security into every layer of every digital device. In combining the security expertise of McAfee with the innovation, performance, and trust of Intel, this vision becomes a reality.

As we rely on technology to enhance our everyday and business life, we must too consider the security of the intellectual property and confidential data that is housed on these devices. As we increase the number of devices we use, we increase the number of gateways and opportunity for security threats. Intel Security takes the “security connected” approach to ensure that every device is secure, and that all security solutions are seamlessly integrated.
Featured Writers
White Papers
Cartoon
Current Issue
Dark Reading's October Tech Digest
Fast data analysis can stymie attacks and strengthen enterprise security. Does your team have the data smarts?
Flash Poll
Video
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2014-7877
Published: 2014-10-30
Unspecified vulnerability in the kernel in HP HP-UX B.11.31 allows local users to cause a denial of service via unknown vectors.

CVE-2014-3051
Published: 2014-10-29
The Internet Service Monitor (ISM) agent in IBM Tivoli Composite Application Manager (ITCAM) for Transactions 7.1 and 7.2 before 7.2.0.3 IF28, 7.3 before 7.3.0.1 IF30, and 7.4 before 7.4.0.0 IF18 does not verify X.509 certificates from SSL servers, which allows man-in-the-middle attackers to spoof s...

CVE-2014-3668
Published: 2014-10-29
Buffer overflow in the date_from_ISO8601 function in the mkgmtime implementation in libxmlrpc/xmlrpc.c in the XMLRPC extension in PHP before 5.4.34, 5.5.x before 5.5.18, and 5.6.x before 5.6.2 allows remote attackers to cause a denial of service (application crash) via (1) a crafted first argument t...

CVE-2014-3669
Published: 2014-10-29
Integer overflow in the object_custom function in ext/standard/var_unserializer.c in PHP before 5.4.34, 5.5.x before 5.5.18, and 5.6.x before 5.6.2 allows remote attackers to cause a denial of service (application crash) or possibly execute arbitrary code via an argument to the unserialize function ...

CVE-2014-3670
Published: 2014-10-29
The exif_ifd_make_value function in exif.c in the EXIF extension in PHP before 5.4.34, 5.5.x before 5.5.18, and 5.6.x before 5.6.2 operates on floating-point arrays incorrectly, which allows remote attackers to cause a denial of service (heap memory corruption and application crash) or possibly exec...

Best of the Web
Dark Reading Radio
Archived Dark Reading Radio
Follow Dark Reading editors into the field as they talk with noted experts from the security world.