Researchers Hunt Sources Of Viruses, Memes

Swiss university researchers propose a method for tracking back biological infections using incomplete data that could work for digital viruses and informational memes
A mathematical model using sparse clues to locate the source of a biological epidemic could give data scientists tools to find the origin of rumors spread over social networks or the initial compromise of an online attack.

In a paper (PDF) published in this month's Physical Review Letters, Pedro Pinto and two colleagues at the Swiss Federal Institute of Technology in Lausanne (EPFL) found that a relatively low number of observers or sensors could gather enough information to determine the source of an infection with 90 percent probability. By understanding the diffusion process -- whether it be of viruses, chemicals, or information -- researchers could use the model to estimate the number of sensors necessary to find the source of a variety of epidemic-like processes, says Pedro Pinto, a post-doctoral student at EPFL.

"Other papers look at knowing everything at every individual node, but we put on a practical constraint: that you only know information from a limited number of sensors," he says.

The researchers investigated four different types of network graphs, finding that under the best circumstances -- choosing highly connected nodes to be observers -- only 4 percent of nodes need be recorded to have a 90 percent chance of tracking back to the source. For a random selection of nodes, the proportion of observers that need to be selected is much higher -- up to 49 percent in the worst case.

Other researchers called into question how applicable the research would be to the digital world. While the EPFL researchers verified their technique using real-world data on a cholera outbreak in South Africa -- finding they could get within four hops of the actual source -- the digital world is a different beast, says Stefan Savage, a professor of computer science at the University of California at San Diego.

"The basic idea here is reasonable -- trying to infer origins by reversing the dynamics of the spreading process -- although not totally new in a cybercontext," Savage says. "But this is one of those cases where the devil is in the details: What can you actually observe, where are there, in fact, strong topological dependencies, and how hard is it for the adversary to mask their origins?"

Even EPFL's Pinto acknowledges that the technique relies on how well the specific environment can be modeled. Each case has to be modeled as a tree or graph of nodes and -- while the researchers only depended on the timing of infection -- other information could be taken into account as well. Pinto is evaluating cases involving Internet security that could be the focus of further research.

"Each application requires us to tweak the model," he says. "There are always little details that are different for each case."

[ Traditional cybercriminals are using the same hacking tools that cyberespionage attackers employ in order to maintain a stealthy foothold inside a victim organization. See The Intersection Between Cyberespionage And Cybercrime. ]

For example, in the case of the South African cholera outbreak, each node was a human community and associated water reservoir, while the edges of the graph were waterways and other means of spreading the outbreak.

Modeling computer networks as connected graphs is much easier, and timing data is generally much more precise -- at least in local-area networks, says Richard Bejtlich, chief security officer for security services firm Mandiant.

"This is one of the few areas where the digital world has an advantage over the physical world," Bejtlich says. "For example, we put our software in an enterprise, and we sweep the network -- we can ask questions and get answers back. To do that in a human population, we would have to ask everyone to submit to blood tests."

Finally, the problem of tracking back the outbreak of digital worms and viruses to their sources is not a new subject of security research. In 2005, a group of researchers from Carnegie Mellon University found that retracing a worm's trail (PDF) could be done using random walks back to the source. Yet even that approach assumed the availability of complete data, which may not be the case.

"It is likely that traffic auditing will be deployed incrementally across different networks," stated the paper. "We are investigating the impact of missing data on performance."

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.