Analytics // Security Monitoring
8/10/2012
08:07 PM
Connect Directly
RSS
E-Mail
50%
50%

PubCrawl Detects Automated Abuse Of Websites

Academic researchers create a program to detect unwanted and malicious Web crawlers, blocking them from harvesting proprietary and sensitive data

A group of researchers from the University of California Santa Barbara and Northeastern University have created a system called PubCrawl for detecting Web crawlers, even if the automated bots are coming from a distributed collection of Internet addresses.

The system combines multiple methods of discriminating between automated traffic and normal user requests, using both content and timing analysis to model traffic from a collection of IP addresses, the researchers stated in a paper to be presented at the USENIX Security Conference on Friday. Websites want to allow legitimate visitors to get the data they need from their pages, while blocking wholesale scraping of content by competitors, attackers and others who want to use the data for non-beneficial purposes, says Christopher Kruegel, an associate professor at UCSB and one of the authors of the paper.

"You want to make it easy for one person to get a small slice of the data," he says. "But you don't want to allow one person to get all the information."

Using data from a large, unnamed social network, the team trained the PubCrawl system to detect automated crawlers and then deployed the system to block unwanted traffic to a production server. The researchers had a high success rate: Crawlers were positively identified more than 95 percent of the time, with perfect detection of unauthorized crawlers and nearly 99 percent recognition of crawlers that masquerade as Web bots from a legitimate service.

A significant advance for crawler detection is the recognizing the difference in traffic patterns between human visitors and Web bots, says Gregoire Jacob, a research scientist at UCSB and another co-author of the paper. By looking at the distribution of requests over time, the system can more accurately detect bots. When the researchers graphed a variety of traffic patterns, the differences became obvious, says Jacob.

"We realized that there is a fundamental difference," he says. "A crawler is a very stable signal -- it's almost a square signal. With a user, there is a lot of variation."

[A Web-security firm launches a site for cataloging Web bots, the automated programs that crawl websites to index pages, grab competitive price information, gather information on social-networking users, or scan for vulnerabilities. See Gather Intelligence On Web Bots To Aid Defense.]

The researchers did not stop at using the signal patterns to improve the accuracy of their system. The team also tried to link similar patterns between disparate Internet sources that could indicate a distributed Web crawler. The PubCrawl system clusters Internet addresses that demonstrate similar traffic patterns into crawling campaigns.

Such distributed networks are the main threat to any attempt to prevent content scraping. PubCrawl can be set to allow a certain number of "free" requests per Internet address -- under that limit, no request will be denied. Once above that limit, then the system will attempt to identify the traffic pattern. Attacks that use a very large number of low-bandwidth requests could escape notice.

"That is the limit of the detection, when attackers are able to mimic users in a distributed non-regular fashion, it make it difficult to catch," says UCSB's Kruegel. "But right now, attackers are very far from that."

For traffic above the minimum threshold that does not match any known pattern, the PubCrawl system uses an active countermeasure, forcing the user to input the occasional CAPTCHA. Sources of requests that ask for non-existent pages, fail to revisit pages, have odd referrer fields, and ignore cookies will all be flagged as automated crawlers much more quickly.

Much of this is not new to the industry, says Matthew Prince, CEO of Cloudflare, a website availability and security service. Companies such as Incapsula, Akamai, and Cloudflare have already created techniques to find and classify Web crawlers.

"We see a huge amount of traffic and are able to automatically classify most of the Web's bots and crawlers in order to better protect sites from bad bots while ensuring they're still are accessible by good bots," Prince says.

Rival security firm Incapsula has noted the increase in automated Web traffic, which, in February, reached 51 percent of all traffic seen by websites. While 20 percent of Web requests are search engine indexers and other good bots, 31 percent are competitors and intelligence gathering bots as well as site scrapers, comment spammers, and vulnerability scans.

Yet with Web traffic set to increase five-fold by 2016, teasing out which traffic is good and which is bad will become more difficult, says Sumit Agarwal, vice president of product management for security start-up Shape Security.

"Being able to control your website while being open and accessible is going to be the biggest challenge for Web service firms in the future," he says.

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.

Comment  | 
Print  | 
More Insights
Register for Dark Reading Newsletters
White Papers
Flash Poll
Current Issue
Cartoon
Video
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2013-6117
Published: 2014-07-11
Dahua DVR 2.608.0000.0 and 2.608.GV00.0 allows remote attackers to bypass authentication and obtain sensitive information including user credentials, change user passwords, clear log files, and perform other actions via a request to TCP port 37777.

CVE-2014-0174
Published: 2014-07-11
Cumin (aka MRG Management Console), as used in Red Hat Enterprise MRG 2.5, does not include the HTTPOnly flag in a Set-Cookie header for the session cookie, which makes it easier for remote attackers to obtain potentially sensitive information via script access to this cookie.

CVE-2014-3485
Published: 2014-07-11
The REST API in the ovirt-engine in oVirt, as used in Red Hat Enterprise Virtualization (rhevm) 3.4, allows remote authenticated users to read arbitrary files and have other unspecified impact via unknown vectors, related to an XML External Entity (XXE) issue.

CVE-2014-3499
Published: 2014-07-11
Docker 1.0.0 uses world-readable and world-writable permissions on the management socket, which allows local users to gain privileges via unspecified vectors.

CVE-2014-3503
Published: 2014-07-11
Apache Syncope 1.1.x before 1.1.8 uses weak random values to generate passwords, which makes it easier for remote attackers to guess the password via a brute force attack.

Best of the Web
Dark Reading Radio
Archived Dark Reading Radio
Marilyn Cohodas and her guests look at the evolving nature of the relationship between CIO and CSO.