Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Operations

5/21/2020
04:55 PM
Connect Directly
Twitter
LinkedIn
RSS
E-Mail
50%
50%

Web Scrapers Have Bigger-Than-Perceived Impact on Digital Businesses

The economic impact of bot traffic can be unexpectedly substantial, a PerimeterX-commissioned study finds.

Automated bots that collect content, product descriptions, pricing, inventory data, and other public-facing information from websites have a greater economic and performance impact than many organizations might realize, a new study suggests.

Bot mitigation company PerimeterX recently commissioned market intelligence firm Aberdeen Group to look into how web-scraping bots might be affecting the revenues of digital businesses.

The study found bots account for between 40% and 60% of total website traffic in certain industries and can impact businesses in multiple ways, including overloading their infrastructure, skewing analytics data, and diminishing the value of their IP, marketing, and SEO investments. The impact to revenues from such factors is considerable, according to PerimeterX.

"Web scraping hurts your revenue in more ways than you know," says Deepak Patel, security evangelist at PerimeterX. For the e-commerce sector, website scraping can dilute overall annual website profitability by as much as 80%, the study shows.

"For the media sector, the median annual business impact of website scraping is as much as 27% of overall website profitability," Patel adds.

Many organizations don't view web-scraping bots as a security threat because they don't breach the network or exploit a security flaw. However, they do pose a big threat to business logic or proprietary content essential for maintaining a competitive edge.

"Malicious web-scraping bots can steal your exclusive, copyrighted content and images," says Patel, adding that it can also damage a site's SEO rankings when search engines detect pages with duplicate content.

Organizations routinely use web scrapers to look up information on their competition, to build services based off of third-party data, or for a variety of other reasons. The bots scour websites — in much the same way search engine crawlers do — and collect any information the operator might have publicly posted and would be useful to the organization using the bots.

Though there are some questions over the legality of the practice, numerous products and services are available that allow organizations to scrape another firm's website for information that is available publicly. In a lawsuit involving talent management advisory firm hiQ Labs and LinkedIn, the Ninth Circuit Court of Appeals last year held that the scraping of publicly available data does not violate US computer fraud laws. LinkedIn had wanted hiQ to stop scraping publicly available data from its site, which the latter was using to create analytics tools to help companies deal with employee retention issues.

"As a technical matter, web scraping is simply machine-automated web browsing and accesses and records the same information, which a human visitor to the site might do manually," the Electronic Frontier Foundation had noted in welcoming the appellate court's decision.

Bad Bots
The study shows that while humans and "good bots" — such as those used by search engines— represented a substantial proportion of web traffic, "bad bots" represented a significant proportion as well. Nearly 17% of all traffic on e-commerce websites, for example, was comprised of bad bots. On travel sites, the proportion was closer to 31% and on media sites around 9.5%.

Patel says bad bots are bots that crawl websites to perform abusive or malicious actions, including account takeover and content plagiarism. Such bots often mimic human behavior and use multiple IPs to evade detection.

They also can scrape content that other sites might have invested in substantially to develop — like SEO-optimized product descriptions or marketing content, for instance. For companies that are doing the scraping, such content can help reduce or even eliminate the need to develop their own content. Conversely, for digital businesses that are the targets, web scraping can potentially erode the value of their investments, the study found. Similarly, information that companies need to put on their sites — like pricing information or product availability — could help rivals gain valuable insight for making their own decisions.

Bot traffic can also overload web infrastructure by sending millions of requests to a specific path, such as login or checkout pages, causing a slowdown for users, Patel says. According to him, 80% of account logins originate from bad bots.

"Scraping bots can significantly impact website performance since they have to collect a lot of data quickly," Patel says. On retail sites, for example, the traffic from bots trying to keep pace with new product listings or pricing changes can degrade performance.

Many tools are commercially available that are designed to help digital businesses deal with web scrapers.

"But today's bots, unlike more crude, basic bots of the past, are becoming more adept at mimicking actual users and disguising their true purpose," Patel says. "Hyper-distributed scraping attacks, achieved by using many different user agents, IPs, and [autonomous system numbers] are even more dangerous, resulting in higher volume and higher difficulty of detection."

Related Content:

 

 
 
 
 
Learn from industry experts in a setting that is conducive to interaction and conversation about how to prepare for that "really  bad day" in cybersecurity. Click for more information and to register
 
Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year ... View Full Bio
 

Recommended Reading:

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
COVID-19: Latest Security News & Commentary
Dark Reading Staff 9/25/2020
Hacking Yourself: Marie Moe and Pacemaker Security
Gary McGraw Ph.D., Co-founder Berryville Institute of Machine Learning,  9/21/2020
Startup Aims to Map and Track All the IT and Security Things
Kelly Jackson Higgins, Executive Editor at Dark Reading,  9/22/2020
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
Special Report: Computing's New Normal
This special report examines how IT security organizations have adapted to the "new normal" of computing and what the long-term effects will be. Read it and get a unique set of perspectives on issues ranging from new threats & vulnerabilities as a result of remote working to how enterprise security strategy will be affected long term.
Flash Poll
How IT Security Organizations are Attacking the Cybersecurity Problem
How IT Security Organizations are Attacking the Cybersecurity Problem
The COVID-19 pandemic turned the world -- and enterprise computing -- on end. Here's a look at how cybersecurity teams are retrenching their defense strategies, rebuilding their teams, and selecting new technologies to stop the oncoming rise of online attacks.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2020-15208
PUBLISHED: 2020-09-25
In tensorflow-lite before versions 1.15.4, 2.0.3, 2.1.2, 2.2.1 and 2.3.1, when determining the common dimension size of two tensors, TFLite uses a `DCHECK` which is no-op outside of debug compilation modes. Since the function always returns the dimension of the first tensor, malicious attackers can ...
CVE-2020-15209
PUBLISHED: 2020-09-25
In tensorflow-lite before versions 1.15.4, 2.0.3, 2.1.2, 2.2.1 and 2.3.1, a crafted TFLite model can force a node to have as input a tensor backed by a `nullptr` buffer. This can be achieved by changing a buffer index in the flatbuffer serialization to convert a read-only tensor to a read-write one....
CVE-2020-15210
PUBLISHED: 2020-09-25
In tensorflow-lite before versions 1.15.4, 2.0.3, 2.1.2, 2.2.1 and 2.3.1, if a TFLite saved model uses the same tensor as both input and output of an operator, then, depending on the operator, we can observe a segmentation fault or just memory corruption. We have patched the issue in d58c96946b and ...
CVE-2020-15211
PUBLISHED: 2020-09-25
In TensorFlow Lite before versions 1.15.4, 2.0.3, 2.1.2, 2.2.1 and 2.3.1, saved models in the flatbuffer format use a double indexing scheme: a model has a set of subgraphs, each subgraph has a set of operators and each operator has a set of input/output tensors. The flatbuffer format uses indices f...
CVE-2020-15212
PUBLISHED: 2020-09-25
In TensorFlow Lite before versions 2.2.1 and 2.3.1, models using segment sum can trigger writes outside of bounds of heap allocated buffers by inserting negative elements in the segment ids tensor. Users having access to `segment_ids_data` can alter `output_index` and then write to outside of `outpu...