Academic researchers create a program to detect unwanted and malicious Web crawlers, blocking them from harvesting proprietary and sensitive data

Dark Reading Staff, Dark Reading

August 10, 2012

4 Min Read

A group of researchers from the University of California Santa Barbara and Northeastern University have created a system called PubCrawl for detecting Web crawlers, even if the automated bots are coming from a distributed collection of Internet addresses.

The system combines multiple methods of discriminating between automated traffic and normal user requests, using both content and timing analysis to model traffic from a collection of IP addresses, the researchers stated in a paper to be presented at the USENIX Security Conference on Friday. Websites want to allow legitimate visitors to get the data they need from their pages, while blocking wholesale scraping of content by competitors, attackers and others who want to use the data for non-beneficial purposes, says Christopher Kruegel, an associate professor at UCSB and one of the authors of the paper.

"You want to make it easy for one person to get a small slice of the data," he says. "But you don't want to allow one person to get all the information."

Using data from a large, unnamed social network, the team trained the PubCrawl system to detect automated crawlers and then deployed the system to block unwanted traffic to a production server. The researchers had a high success rate: Crawlers were positively identified more than 95 percent of the time, with perfect detection of unauthorized crawlers and nearly 99 percent recognition of crawlers that masquerade as Web bots from a legitimate service.

A significant advance for crawler detection is the recognizing the difference in traffic patterns between human visitors and Web bots, says Gregoire Jacob, a research scientist at UCSB and another co-author of the paper. By looking at the distribution of requests over time, the system can more accurately detect bots. When the researchers graphed a variety of traffic patterns, the differences became obvious, says Jacob.

"We realized that there is a fundamental difference," he says. "A crawler is a very stable signal -- it's almost a square signal. With a user, there is a lot of variation."

[A Web-security firm launches a site for cataloging Web bots, the automated programs that crawl websites to index pages, grab competitive price information, gather information on social-networking users, or scan for vulnerabilities. See Gather Intelligence On Web Bots To Aid Defense.]

The researchers did not stop at using the signal patterns to improve the accuracy of their system. The team also tried to link similar patterns between disparate Internet sources that could indicate a distributed Web crawler. The PubCrawl system clusters Internet addresses that demonstrate similar traffic patterns into crawling campaigns.

Such distributed networks are the main threat to any attempt to prevent content scraping. PubCrawl can be set to allow a certain number of "free" requests per Internet address -- under that limit, no request will be denied. Once above that limit, then the system will attempt to identify the traffic pattern. Attacks that use a very large number of low-bandwidth requests could escape notice.

"That is the limit of the detection, when attackers are able to mimic users in a distributed non-regular fashion, it make it difficult to catch," says UCSB's Kruegel. "But right now, attackers are very far from that."

For traffic above the minimum threshold that does not match any known pattern, the PubCrawl system uses an active countermeasure, forcing the user to input the occasional CAPTCHA. Sources of requests that ask for non-existent pages, fail to revisit pages, have odd referrer fields, and ignore cookies will all be flagged as automated crawlers much more quickly.

Much of this is not new to the industry, says Matthew Prince, CEO of Cloudflare, a website availability and security service. Companies such as Incapsula, Akamai, and Cloudflare have already created techniques to find and classify Web crawlers.

"We see a huge amount of traffic and are able to automatically classify most of the Web's bots and crawlers in order to better protect sites from bad bots while ensuring they're still are accessible by good bots," Prince says.

Rival security firm Incapsula has noted the increase in automated Web traffic, which, in February, reached 51 percent of all traffic seen by websites. While 20 percent of Web requests are search engine indexers and other good bots, 31 percent are competitors and intelligence gathering bots as well as site scrapers, comment spammers, and vulnerability scans.

Yet with Web traffic set to increase five-fold by 2016, teasing out which traffic is good and which is bad will become more difficult, says Sumit Agarwal, vice president of product management for security start-up Shape Security.

"Being able to control your website while being open and accessible is going to be the biggest challenge for Web service firms in the future," he says.

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.

About the Author(s)

Dark Reading Staff

Dark Reading

Dark Reading is a leading cybersecurity media site.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights