Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Operations

5/21/2020
04:55 PM
Connect Directly
Twitter
LinkedIn
RSS
E-Mail
50%
50%

Web Scrapers Have Bigger-Than-Perceived Impact on Digital Businesses

The economic impact of bot traffic can be unexpectedly substantial, a PerimeterX-commissioned study finds.

Automated bots that collect content, product descriptions, pricing, inventory data, and other public-facing information from websites have a greater economic and performance impact than many organizations might realize, a new study suggests.

Bot mitigation company PerimeterX recently commissioned market intelligence firm Aberdeen Group to look into how web-scraping bots might be affecting the revenues of digital businesses.

The study found bots account for between 40% and 60% of total website traffic in certain industries and can impact businesses in multiple ways, including overloading their infrastructure, skewing analytics data, and diminishing the value of their IP, marketing, and SEO investments. The impact to revenues from such factors is considerable, according to PerimeterX.

"Web scraping hurts your revenue in more ways than you know," says Deepak Patel, security evangelist at PerimeterX. For the e-commerce sector, website scraping can dilute overall annual website profitability by as much as 80%, the study shows.

"For the media sector, the median annual business impact of website scraping is as much as 27% of overall website profitability," Patel adds.

Many organizations don't view web-scraping bots as a security threat because they don't breach the network or exploit a security flaw. However, they do pose a big threat to business logic or proprietary content essential for maintaining a competitive edge.

"Malicious web-scraping bots can steal your exclusive, copyrighted content and images," says Patel, adding that it can also damage a site's SEO rankings when search engines detect pages with duplicate content.

Organizations routinely use web scrapers to look up information on their competition, to build services based off of third-party data, or for a variety of other reasons. The bots scour websites — in much the same way search engine crawlers do — and collect any information the operator might have publicly posted and would be useful to the organization using the bots.

Though there are some questions over the legality of the practice, numerous products and services are available that allow organizations to scrape another firm's website for information that is available publicly. In a lawsuit involving talent management advisory firm hiQ Labs and LinkedIn, the Ninth Circuit Court of Appeals last year held that the scraping of publicly available data does not violate US computer fraud laws. LinkedIn had wanted hiQ to stop scraping publicly available data from its site, which the latter was using to create analytics tools to help companies deal with employee retention issues.

"As a technical matter, web scraping is simply machine-automated web browsing and accesses and records the same information, which a human visitor to the site might do manually," the Electronic Frontier Foundation had noted in welcoming the appellate court's decision.

Bad Bots
The study shows that while humans and "good bots" — such as those used by search engines— represented a substantial proportion of web traffic, "bad bots" represented a significant proportion as well. Nearly 17% of all traffic on e-commerce websites, for example, was comprised of bad bots. On travel sites, the proportion was closer to 31% and on media sites around 9.5%.

Patel says bad bots are bots that crawl websites to perform abusive or malicious actions, including account takeover and content plagiarism. Such bots often mimic human behavior and use multiple IPs to evade detection.

They also can scrape content that other sites might have invested in substantially to develop — like SEO-optimized product descriptions or marketing content, for instance. For companies that are doing the scraping, such content can help reduce or even eliminate the need to develop their own content. Conversely, for digital businesses that are the targets, web scraping can potentially erode the value of their investments, the study found. Similarly, information that companies need to put on their sites — like pricing information or product availability — could help rivals gain valuable insight for making their own decisions.

Bot traffic can also overload web infrastructure by sending millions of requests to a specific path, such as login or checkout pages, causing a slowdown for users, Patel says. According to him, 80% of account logins originate from bad bots.

"Scraping bots can significantly impact website performance since they have to collect a lot of data quickly," Patel says. On retail sites, for example, the traffic from bots trying to keep pace with new product listings or pricing changes can degrade performance.

Many tools are commercially available that are designed to help digital businesses deal with web scrapers.

"But today's bots, unlike more crude, basic bots of the past, are becoming more adept at mimicking actual users and disguising their true purpose," Patel says. "Hyper-distributed scraping attacks, achieved by using many different user agents, IPs, and [autonomous system numbers] are even more dangerous, resulting in higher volume and higher difficulty of detection."

Related Content:

 

 
 
 
 
Learn from industry experts in a setting that is conducive to interaction and conversation about how to prepare for that "really  bad day" in cybersecurity. Click for more information and to register
 
Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year ... View Full Bio
 

Recommended Reading:

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
COVID-19: Latest Security News & Commentary
Dark Reading Staff 6/3/2020
Data Loss Spikes Under COVID-19 Lockdowns
Seth Rosenblatt, Contributing Writer,  5/28/2020
Abandoned Apps May Pose Security Risk to Mobile Devices
Robert Lemos, Contributing Writer,  5/29/2020
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: This comment is waiting for review by our moderators.
Current Issue
How Cybersecurity Incident Response Programs Work (and Why Some Don't)
This Tech Digest takes a look at the vital role cybersecurity incident response (IR) plays in managing cyber-risk within organizations. Download the Tech Digest today to find out how well-planned IR programs can detect intrusions, contain breaches, and help an organization restore normal operations.
Flash Poll
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2020-6497
PUBLISHED: 2020-06-03
Insufficient policy enforcement in Omnibox in Google Chrome on iOS prior to 83.0.4103.88 allowed a remote attacker to perform domain spoofing via a crafted URI.
CVE-2020-6498
PUBLISHED: 2020-06-03
Incorrect implementation in user interface in Google Chrome on iOS prior to 83.0.4103.88 allowed a remote attacker to perform domain spoofing via a crafted HTML page.
CVE-2020-6499
PUBLISHED: 2020-06-03
Inappropriate implementation in AppCache in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to bypass AppCache security restrictions via a crafted HTML page.
CVE-2020-6500
PUBLISHED: 2020-06-03
Inappropriate implementation in interstitials in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to spoof the contents of the Omnibox (URL bar) via a crafted HTML page.
CVE-2020-6501
PUBLISHED: 2020-06-03
Insufficient policy enforcement in CSP in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to bypass content security policy via a crafted HTML page.