Perimeter

10/10/2018
10:30 AM
Kaan Onarlioglu
Kaan Onarlioglu
Commentary
Connect Directly
LinkedIn
RSS
E-Mail vvv
50%
50%

Security Researchers Struggle with Bot Management Programs

Bots are a known problem, but researchers will tell you that bot defenses create problems of their own when it comes to valuable data.

Bot management is all the rage in the security world. Every day, I find myself bombarded with articles proclaiming that N percent of Internet traffic is generated by bots, where N is a sufficiently alarming number to make most executives want to dash out and purchase the first bot-defense product in sight. While I can't speak for the accuracy of those reports, one thing's certain: There's a growing demand for effective bot mitigation.

I know. I work for a company that develops one such bot management solution, and I talk to customers about it daily. I do enjoy having some semblance of job security, but being the recovering academic that I am, I'm also really concerned. Conducting large-scale Internet crawls is an all too common task in many fields of security research. Does the research community fully understand the implications of bot defenses on their experiments? Do they do anything about it? I am not optimistic.

Bot is a notoriously overloaded term with numerous meanings. Today, the term is understood to mean any software that performs automated tasks over the Internet. This includes malware such as those comprising a botnet, but also benign software like search engines and information aggregators. Conveniently, this definition is aligned with the features of popular bot management solutions; businesses certainly want malware protection, but they also have strong incentives to monitor, limit, block, or even serve false content to automated requests reaching their web properties.

This is a serious problem for security researchers.

Data collection via Internet crawls is a crucial part of security research. In my own work, I crawled millions of websites and scraped application stores, code repositories, forums, vulnerability databases, and more. Think about it. Researchers meticulously design experiments, build and analyze invaluable data sets in a scientific framework, and (sometimes literally) fight to publish and present their results at prestigious conferences, only to discover that their data set was tainted by a plethora of bot defenses scattered around the Internet.

In the best case, the collected data would be biased because servers equipped with bot defenses would block the connection or return a static page without meaningful content. And if worst comes to worst, servers that return false information to thwart information harvesters could make it nigh impossible to even detect that something somewhere went wrong.

I have no reason to doubt this situation significantly affects Internet crawls and measurement studies — today. In all likelihood, we regularly work with bad data, and then publish and read papers with skewed results. But we just don't yet have insights into how data collection is affected by bot defenses.

A solution is not likely to come from the business side. Widespread adoption of bot defenses won't be tapering off anytime soon. There simply isn't enough motivation for businesses to back down from their strong stance against bots; they won't forgo protection to accommodate a few innocuous crawlers among myriad malicious hits.

As far as researchers are concerned, there's always been a certain degree of awareness of anti-crawling techniques. Researchers came up with best practices such as crafting realistic request headers, limiting connection rates, and building crawlers on headless browsers. However, modern bot defenses are well-prepared to catch these tricks; they analyze browser characteristics, connection patterns, packet structure, and even hardware inputs, and combine these observations in nontrivial ways to distinguish between humans and our robot overlords.

Yes, even the most intricate defense can be reverse-engineered and bypassed given enough resources and dedication. The bar, however, is high. Faced with a growing number of evolving bot management products, researchers are perpetually at a disadvantage.

The Need for Change
We need a paradigm shift. Here is an idea: The next time we run a crawl, let's acknowledge that the entire Internet is out there to corrupt our data, and duly deal with it! Data validation is key. Questionable data collection methodologies and low-quality data sets aren't exactly unknown territory for the research community, but we need even greater focus on this issue today.

I'm all too familiar with that urge to rush through data collection and get to the more interesting data analysis (and then submit a half-decent paper minutes before a deadline). This approach is missing the mark if it leads to inaccurate measurements and incorrect conclusions.

Data validation is a hard problem, but at the same time it's a well-explored area of computer science. We have the necessary tools, like constraint validation for predictable data, or clustering to spot outliers in complex data sets. When all else fails, manual analysis combined with sampling can be a surprisingly effective and viable approach, even for extremely large datasets. It's well worth putting in the extra time and effort to systematically validate data, in addition to writing at length about the process in publications, so that the reviewers and readers know we did our part.

Finally, I'll point out that this problem has an interesting beneficial side effect: the potential to open up unique research directions. Enabling functional yet ethical crawling techniques that are also aligned with businesses' needs is one obvious route this can take. However, I also anticipate novel techniques that can scientifically quantify the impact of bot defenses on measurements.

With better insights and visibility into this issue, we can better recognize our limitations, and pursue the promising paths toward a solution.

Related Content:

 

Black Hat Europe returns to London Dec. 3-6, 2018, with hands-on technical Trainings, cutting-edge Briefings, Arsenal open-source tool demonstrations, top-tier security solutions, and service providers in the Business Hall. Click for information on the conference and to register.

Kaan Onarlioglu is a researcher and engineer at Akamai who is interested in a wide array of systems security problems, with an emphasis on designing practical technologies with real-life impact. He works to make computers and the Internet secure — but occasionally ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
New Free Tool Scans for Chrome Extension Safety
Dark Reading Staff 2/21/2019
Making the Case for a Cybersecurity Moon Shot
Adam Shostack, Consultant, Entrepreneur, Technologist, Game Designer,  2/19/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
5 Emerging Cyber Threats to Watch for in 2019
Online attackers are constantly developing new, innovative ways to break into the enterprise. This Dark Reading Tech Digest gives an in-depth look at five emerging attack trends and exploits your security team should look out for, along with helpful recommendations on how you can prevent your organization from falling victim.
Flash Poll
How Enterprises Are Attacking the Cybersecurity Problem
How Enterprises Are Attacking the Cybersecurity Problem
Data breach fears and the need to comply with regulations such as GDPR are two major drivers increased spending on security products and technologies. But other factors are contributing to the trend as well. Find out more about how enterprises are attacking the cybersecurity problem by reading our report today.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-6485
PUBLISHED: 2019-02-22
Citrix NetScaler Gateway 12.1 before build 50.31, 12.0 before build 60.9, 11.1 before build 60.14, 11.0 before build 72.17, and 10.5 before build 69.5 and Application Delivery Controller (ADC) 12.1 before build 50.31, 12.0 before build 60.9, 11.1 before build 60.14, 11.0 before build 72.17, and 10.5...
CVE-2019-9020
PUBLISHED: 2019-02-22
An issue was discovered in PHP before 5.6.40, 7.x before 7.1.26, 7.2.x before 7.2.14, and 7.3.x before 7.3.1. Invalid input to the function xmlrpc_decode() can lead to an invalid memory access (heap out of bounds read or read after free). This is related to xml_elem_parse_buf in ext/xmlrpc/libxmlrpc...
CVE-2019-9021
PUBLISHED: 2019-02-22
An issue was discovered in PHP before 5.6.40, 7.x before 7.1.26, 7.2.x before 7.2.14, and 7.3.x before 7.3.1. A heap-based buffer over-read in PHAR reading functions in the PHAR extension may allow an attacker to read allocated or unallocated memory past the actual data when trying to parse the file...
CVE-2019-9022
PUBLISHED: 2019-02-22
An issue was discovered in PHP 7.x before 7.1.26, 7.2.x before 7.2.14, and 7.3.x before 7.3.2. dns_get_record misparses a DNS response, which can allow a hostile DNS server to cause PHP to misuse memcpy, leading to read operations going past the buffer allocated for DNS data. This affects php_parser...
CVE-2019-9023
PUBLISHED: 2019-02-22
An issue was discovered in PHP before 5.6.40, 7.x before 7.1.26, 7.2.x before 7.2.14, and 7.3.x before 7.3.1. A number of heap-based buffer over-read instances are present in mbstring regular expression functions when supplied with invalid multibyte data. These occur in ext/mbstring/oniguruma/regcom...