Risk
2/22/2012
01:51 PM
50%
50%

How Anonymous Are Your Online Posts?

Beware flamebait-throwers, grammar police, and all-around trolls: New algorithm can correctly identify an author 80% of the time, using sufficient source documents.

Anonymous: 10 Facts About The Hacktivist Group
Anonymous: 10 Facts About The Hacktivist Group
(click image for larger view and for slideshow)
By applying "linguistic stylometry," a team of researchers from Stanford University and the University of California, Berkeley, has built an algorithm that can often match existing bodies of writing--for which the author is known--with anonymous postings. They plan to present the results of their research at the IEEE Symposium on Security and Privacy in May.

"Stylometric identification exploits the fact that we all have a 'fingerprint' based on our stylistic choices and idiosyncrasies with the written word," said report co-author Arvind Narayana, a post-doctoral computer science researcher at Stanford, in a blog post. Interestingly, Narayana's previous work has included studying how to break the anonymity of Netflix Prize data, as well as highlighting the difficulty of remaining anonymous on social networks.

In this case, by using linguistic stylometry, the researchers were able to correctly identify authors 20% of the time when analyzing a "corpus of texts from 100,000 authors" for which they had an average of 20 posts per person. "But it gets better from there," said Narayana. "In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won't be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation."

[ Hacktivists have taken down a number of government websites, including the CIA's. Read more at CIA Website Hacked, Struggles To Recover. ]

For example, he said, an author's location could add further context. Likewise, if a law enforcement agency required a service provider to disclose a subscriber's log-in and log-out times, they could compare those with the times that posts were made. Notably, that technique appears to have been used to help identify and bust an alleged LulzSec suspect.

The researchers also found that when they have more written words to draw from, their ability to correctly identify an author of an anonymous text improves noticeably. For example, when working with 40 to 50 attributed posts rather than just 20, researchers pushed their accuracy rate up to 35%. In addition, the researchers found that they could program their algorithm to return results only when it was confident that a match had been found. In such cases, "the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time," Narayana said.

In other words, the days of practical online anonymity may be numbered, despite the right to anonymous free speech--online or otherwise--having been enshrined in U.S. law. As the Supreme Court wrote in a 1995 decision referenced by the researchers, "Anonymity is a shield from the tyranny of the majority ... It thus exemplies the purpose behind the Bill of Rights, and of the First Amendment in particular: to protect unpopular individuals from retaliation ... at the hand of an intolerant society."

But there have been exceptions. To date, some legal requests to force service providers to reveal people's actual identities--typically, to put a subscriber name to an IP address--have been successful. But in general, such identification has first required demonstrating that criminal activity, such as defamation, occurred.

Unfortunately, advances in stylometric identification pose concerns for bloggers or whistleblowers who post anonymously to escape retribution, as well as for the sanctity of online anonymity and free speech in general. Indeed, if technology could be used to identify the authors of anonymous posts, then legal attempts to force service providers to unmask subscribers wouldn't be required.

Thankfully, the Stanford and Berkeley researchers said that online anonymity isn't set to disappear just yet. Notably, their approach isn't reliable unless there's a decent amount of text to analyze. That's true even when analyzing a piece of anonymous text for which there could only be two authors.

Another limitation is that the researchers haven't yet analyzed whether people write differently depending on the medium. To date, they've only compared emails with emails, and blogs with blogs. People's writing style, however, may differ when writing a blog as opposed to an email. As a result, it might be difficult to attribute anonymous emails to an author for whom researchers had only blog posts.

But perhaps the biggest limitation is that "the attack is unlikely to work if the victim intentionally obfuscates their writing style," they said. In other words, anyone who wants to remain anonymous can proactively vary their writing style, swapping word order or hitting the thesaurus to select synonyms they might not otherwise use.

Interestingly, this isn't the first time that computer scientists have attempted to use statistical textual analysis to identity authors. Numerous scholars have subjected Shakespeare’s plays--or as some might say, "the plays attributed to Shakespeare"--to a rigorous statistical analysis, seeking clues as to whether the works may instead have been authored or co-authored by one of The Bard's contemporaries.

Those studies are limited by researchers needing to have enough source material--plays, poems, and letters, for example--from other potential authors to produce statistically significant results. But in the online realm, users of blogs, Twitter, Facebook, and other social networks continue to generate an ever-greater quantity of publicly accessible words written with their own particular linguistic fingerprints.

The right forensic tools in the right hands are just a start. The new Digital Detectives issue of Dark Reading shows you how to better apply the lessons they teach. (Free registration required.)

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: "I don't think that's how Augmented Reality works."
Current Issue
The Changing Face of Identity Management
Mobility and cloud services are altering the concept of user identity. Here are some ways to keep up.
Flash Poll
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2013-7445
Published: 2015-10-15
The Direct Rendering Manager (DRM) subsystem in the Linux kernel through 4.x mishandles requests for Graphics Execution Manager (GEM) objects, which allows context-dependent attackers to cause a denial of service (memory consumption) via an application that processes graphics data, as demonstrated b...

CVE-2015-4948
Published: 2015-10-15
netstat in IBM AIX 5.3, 6.1, and 7.1 and VIOS 2.2.x, when a fibre channel adapter is used, allows local users to gain privileges via unspecified vectors.

CVE-2015-5660
Published: 2015-10-15
Cross-site request forgery (CSRF) vulnerability in eXtplorer before 2.1.8 allows remote attackers to hijack the authentication of arbitrary users for requests that execute PHP code.

CVE-2015-6003
Published: 2015-10-15
Directory traversal vulnerability in QNAP QTS before 4.1.4 build 0910 and 4.2.x before 4.2.0 RC2 build 0910, when AFP is enabled, allows remote attackers to read or write to arbitrary files by leveraging access to an OS X (1) user or (2) guest account.

CVE-2015-6333
Published: 2015-10-15
Cisco Application Policy Infrastructure Controller (APIC) 1.1j allows local users to gain privileges via vectors involving addition of an SSH key, aka Bug ID CSCuw46076.

Dark Reading Radio
Archived Dark Reading Radio

The cybersecurity profession struggles to retain women (figures range from 10 to 20 percent). It's particularly worrisome for an industry with a rapidly growing number of vacant positions.

So why does the shortage of women continue to be worse in security than in other IT sectors? How can men in infosec be better allies for women; and how can women be better allies for one another? What is the industry doing to fix the problem -- what's working, and what isn't?

Is this really a problem at all? Are the low numbers simply an indication that women do not want to be in cybersecurity, and is it possible that more women will never want to be in cybersecurity? How many women would we need to see in the industry to declare success?

Join Dark Reading senior editor Sara Peters and guests Angela Knox of Cloudmark, Barrett Sellers of Arbor Networks, Regina Wallace-Jones of Facebook, Steve Christey Coley of MITRE, and Chris Roosenraad of M3AAWG on Wednesday, July 13 at 1 p.m. Eastern Time to discuss all this and more.