Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


10:00 AM
Rob Simon
Rob Simon
Connect Directly
E-Mail vvv

Defending Against Web Scraping Attacks

Web scraping attacks, like Facebook's recent data leak, can easily lead to more significant breaches.

Web scraping is as old as the Internet, but it's a threat that rarely gets its due. Companies frequently underestimate its risk potential because it is technically not a "hack" or "breach." 

A recent example is Facebook, which has tried to downplay its latest massive data leak by claiming the scraping impacted public information only. The company overlooks the risk this type of personal data exposure poses for the victims and the ultimate value of harvesting this data on such a massive scale, particularly for social engineering attacks.  

Related Content:

How Personally Identifiable Information Can Put Your Company at Risk

Special Report: Tech Insights: Detecting and Preventing Insider Data Leaks

New From The Edge: 10K Hackers Defend the Planet Against Extraterrestrials

Scraping sites for user data is nothing new; Facebook has faced this issue on multiple occasions. In 2013, I disclosed two methods of scraping Facebook user data. One involved a tool I created called Facebook Harvester, which utilized the then-recently released Graph Search feature to perform a brute-force search of phone numbers and return any associated user profile. 

Meanwhile, Facebook is still partially vulnerable to malicious scraping through its password-reset page. By entering a phone number, it is possible to pull up privately listed people on the platform — including their full name and profile photo. This is notably different from the method used in the recent data dump, in that the end user does not need to be publicly searchable. While Facebook has tightened the data revealed on this page, it may still prove a useful tool for malicious actors. 

But scraping isn't just a social media problem. It's an issue that affects many types of organizations across various industries. Scraping is one of the methods malicious hackers use to collect intel on companies before they target them with more significant attacks. 

Here is a closer look at this undervalued threat. 

How Attackers Use Web Scraping
Web scraping can easily lead to more significant attacks. At my company, we routinely use Web scraping as one of the initial steps in a red team or phishing engagement. By pulling the metadata from posted documents, we can find employee names, usernames, and deduce username and email formats, which is particularly helpful when the username format would otherwise be difficult to guess. Mix this with scraping a list of current employees from sites like LinkedIn, and an adversary can perform targeted phishing and credential brute-force attacks. 

In one recent example, we determined the client's unique username configuration by collecting documents scraped from the company's public-facing sites. These documents contained the author's first and last name and the file path; because the file was saved within the user's profile path, the path also contained the username. In this case, the format was two letters of the first name, the last name, and a digit. So, if the user's name were John Smith, the username would have been josmith1. Once we found this, it was easy enough to perform credential brute-forcing by using a list of common first and last names to match the discovered username format. By running the attack with just a few common passwords per username, we gained access to at least one account, which gave our red team an initial foothold. 

Scraping document metadata is also useful for detecting internal hostnames and software versions in use at the targeted company. This enables an attacker to customize the attack to exploit vulnerabilities specific to that company, and it is an important part of victim reconnaissance.  

Adversaries can also use scraping to collect gated information from a website if that information isn't properly protected. Take Facebook's password-reset page: Anyone can find privately listed people through a simple query with a phone number. While a password-reset page may be necessary, does it really need to confirm or, worse, return a user's private information? 

While this may be a worst-case scenario, many websites are still vulnerable to user enumeration via simple error messages. I see this often where a registration, login, or password-reset page returns a message like "the username could not be found" when submitting invalid credentials to the login page or for a password reset. While this seems innocent enough, attackers can abuse this notification to determine which usernames or emails exist as registered accounts for the service. A list of valid usernames could be used for more targeted credential brute-force attacks, and valid emails can be used in targeted phishing attacks.  

Controlling the Threat
There are several ways to reduce the risk of Web scraping.

First, organizations should regularly audit their websites to make sure they are not unintentionally exposing sensitive information to public-facing websites through published documents or information stored in back-end databases that are linked through the website. 

Organizations should also have a process in place to strip metadata from documents before they are published externally. They should prevent exposing things such as usernames, file paths, print queues, and software versions, as these can all be useful in mounting an attack. 

Password-reset pages often contain verbose messages that reveal if a submitted username is valid or not. Going back to the Facebook example, should the password-reset page return the full name and profile picture associated with a phone number before sending a reset link? In these instances, the password-reset page reveals unnecessary information. Where possible, pages should return a generic message after a person submits information for a password reset, letting them know a text or email will be sent to the account if it exists. The key is that the page should not indicate whether the account or information is valid.  

Rate limiting and CAPTCHAs are standard defenses against scraping, but a determined attacker may still be able to bypass these measures by using CAPTCHA-solving services or rotating through a list of IP addresses. These measures should make things more difficult for Web scraping but are not a substitute for the proper protection of sensitive data. 

Recognize the Threat
While Web scraping has long been viewed more as an annoyance than a security risk, it is widely used by attackers to gain critical insights into a company, particularly for user enumeration attacks. Implementing some of these security measures can greatly reduce a company's risk.

Rob Simon is a Principal Security Consultant at TrustedSec, where he specializes in Web and mobile applications, as well as hardware security. Rob has more than a decade of experience in information security, with roles ranging from software development to penetration ... View Full Bio

Recommended Reading:

Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Ransomware Is Not the Problem
Adam Shostack, Consultant, Entrepreneur, Technologist, Game Designer,  6/9/2021
How Can I Test the Security of My Home-Office Employees' Routers?
John Bock, Senior Research Scientist,  6/7/2021
New Ransomware Group Claiming Connection to REvil Gang Surfaces
Jai Vijayan, Contributing Writer,  6/10/2021
Register for Dark Reading Newsletters
White Papers
Cartoon Contest
Write a Caption, Win an Amazon Gift Card! Click Here
Latest Comment: Google's new See No Evil policy......
Current Issue
The State of Cybersecurity Incident Response
In this report learn how enterprises are building their incident response teams and processes, how they research potential compromises, how they respond to new breaches, and what tools and processes they use to remediate problems and improve their cyber defenses for the future.
Flash Poll
How Enterprises are Developing Secure Applications
How Enterprises are Developing Secure Applications
Recent breaches of third-party apps are driving many organizations to think harder about the security of their off-the-shelf software as they continue to move left in secure software development practices.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2021-06-18
RIOT-OS 2021.01 before commit 44741ff99f7a71df45420635b238b9c22093647a contains a buffer overflow which could allow attackers to obtain sensitive information.
PUBLISHED: 2021-06-18
SerenityOS contains a buffer overflow in the set_range test in TestBitmap which could allow attackers to obtain sensitive information.
PUBLISHED: 2021-06-18
SerenityOS in test-crypto.cpp contains a stack buffer overflow which could allow attackers to obtain sensitive information.
PUBLISHED: 2021-06-18
SerenityOS before commit 3844e8569689dd476064a0759d704bc64fb3ca2c contains a directory traversal vulnerability in tar/unzip that may lead to command execution or privilege escalation.
PUBLISHED: 2021-06-18
RIOT-OS 2021.01 before commit 85da504d2dc30188b89f44c3276fc5a25b31251f contains a buffer overflow which could allow attackers to obtain sensitive information.