Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


// // //
10:00 AM
Rob Simon
Rob Simon
Connect Directly
E-Mail vvv

Defending Against Web Scraping Attacks

Web scraping attacks, like Facebook's recent data leak, can easily lead to more significant breaches.

Web scraping is as old as the Internet, but it's a threat that rarely gets its due. Companies frequently underestimate its risk potential because it is technically not a "hack" or "breach." 

A recent example is Facebook, which has tried to downplay its latest massive data leak by claiming the scraping impacted public information only. The company overlooks the risk this type of personal data exposure poses for the victims and the ultimate value of harvesting this data on such a massive scale, particularly for social engineering attacks.  

Related Content:

How Personally Identifiable Information Can Put Your Company at Risk

Special Report: Tech Insights: Detecting and Preventing Insider Data Leaks

New From The Edge: 10K Hackers Defend the Planet Against Extraterrestrials

Scraping sites for user data is nothing new; Facebook has faced this issue on multiple occasions. In 2013, I disclosed two methods of scraping Facebook user data. One involved a tool I created called Facebook Harvester, which utilized the then-recently released Graph Search feature to perform a brute-force search of phone numbers and return any associated user profile. 

Meanwhile, Facebook is still partially vulnerable to malicious scraping through its password-reset page. By entering a phone number, it is possible to pull up privately listed people on the platform — including their full name and profile photo. This is notably different from the method used in the recent data dump, in that the end user does not need to be publicly searchable. While Facebook has tightened the data revealed on this page, it may still prove a useful tool for malicious actors. 

But scraping isn't just a social media problem. It's an issue that affects many types of organizations across various industries. Scraping is one of the methods malicious hackers use to collect intel on companies before they target them with more significant attacks. 

Here is a closer look at this undervalued threat. 

How Attackers Use Web Scraping
Web scraping can easily lead to more significant attacks. At my company, we routinely use Web scraping as one of the initial steps in a red team or phishing engagement. By pulling the metadata from posted documents, we can find employee names, usernames, and deduce username and email formats, which is particularly helpful when the username format would otherwise be difficult to guess. Mix this with scraping a list of current employees from sites like LinkedIn, and an adversary can perform targeted phishing and credential brute-force attacks. 

In one recent example, we determined the client's unique username configuration by collecting documents scraped from the company's public-facing sites. These documents contained the author's first and last name and the file path; because the file was saved within the user's profile path, the path also contained the username. In this case, the format was two letters of the first name, the last name, and a digit. So, if the user's name were John Smith, the username would have been josmith1. Once we found this, it was easy enough to perform credential brute-forcing by using a list of common first and last names to match the discovered username format. By running the attack with just a few common passwords per username, we gained access to at least one account, which gave our red team an initial foothold. 

Scraping document metadata is also useful for detecting internal hostnames and software versions in use at the targeted company. This enables an attacker to customize the attack to exploit vulnerabilities specific to that company, and it is an important part of victim reconnaissance.  

Adversaries can also use scraping to collect gated information from a website if that information isn't properly protected. Take Facebook's password-reset page: Anyone can find privately listed people through a simple query with a phone number. While a password-reset page may be necessary, does it really need to confirm or, worse, return a user's private information? 

While this may be a worst-case scenario, many websites are still vulnerable to user enumeration via simple error messages. I see this often where a registration, login, or password-reset page returns a message like "the username could not be found" when submitting invalid credentials to the login page or for a password reset. While this seems innocent enough, attackers can abuse this notification to determine which usernames or emails exist as registered accounts for the service. A list of valid usernames could be used for more targeted credential brute-force attacks, and valid emails can be used in targeted phishing attacks.  

Controlling the Threat
There are several ways to reduce the risk of Web scraping.

First, organizations should regularly audit their websites to make sure they are not unintentionally exposing sensitive information to public-facing websites through published documents or information stored in back-end databases that are linked through the website. 

Organizations should also have a process in place to strip metadata from documents before they are published externally. They should prevent exposing things such as usernames, file paths, print queues, and software versions, as these can all be useful in mounting an attack. 

Password-reset pages often contain verbose messages that reveal if a submitted username is valid or not. Going back to the Facebook example, should the password-reset page return the full name and profile picture associated with a phone number before sending a reset link? In these instances, the password-reset page reveals unnecessary information. Where possible, pages should return a generic message after a person submits information for a password reset, letting them know a text or email will be sent to the account if it exists. The key is that the page should not indicate whether the account or information is valid.  

Rate limiting and CAPTCHAs are standard defenses against scraping, but a determined attacker may still be able to bypass these measures by using CAPTCHA-solving services or rotating through a list of IP addresses. These measures should make things more difficult for Web scraping but are not a substitute for the proper protection of sensitive data. 

Recognize the Threat
While Web scraping has long been viewed more as an annoyance than a security risk, it is widely used by attackers to gain critical insights into a company, particularly for user enumeration attacks. Implementing some of these security measures can greatly reduce a company's risk.

Rob Simon is a Principal Security Consultant at TrustedSec, where he specializes in Web and mobile applications, as well as hardware security. Rob has more than a decade of experience in information security, with roles ranging from software development to penetration ... View Full Bio
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
I Smell a RAT! New Cybersecurity Threats for the Crypto Industry
David Trepp, Partner, IT Assurance with accounting and advisory firm BPM LLP,  7/9/2021
Attacks on Kaseya Servers Led to Ransomware in Less Than 2 Hours
Robert Lemos, Contributing Writer,  7/7/2021
It's in the Game (but It Shouldn't Be)
Tal Memran, Cybersecurity Expert, CYE,  7/9/2021
Register for Dark Reading Newsletters
White Papers
Current Issue
The 10 Most Impactful Types of Vulnerabilities for Enterprises Today
Managing system vulnerabilities is one of the old est - and most frustrating - security challenges that enterprise defenders face. Every software application and hardware device ships with intrinsic flaws - flaws that, if critical enough, attackers can exploit from anywhere in the world. It's crucial that defenders take stock of what areas of the tech stack have the most emerging, and critical, vulnerabilities they must manage. It's not just zero day vulnerabilities. Consider that CISA's Known Exploited Vulnerabilities (KEV) catalog lists vulnerabilitlies in widely used applications that are "actively exploited," and most of them are flaws that were discovered several years ago and have been fixed. There are also emerging vulnerabilities in 5G networks, cloud infrastructure, Edge applications, and firmwares to consider.
Flash Poll
How Enterprises are Developing Secure Applications
How Enterprises are Developing Secure Applications
Recent breaches of third-party apps are driving many organizations to think harder about the security of their off-the-shelf software as they continue to move left in secure software development practices.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2023-03-17
The Bookly plugin for WordPress is vulnerable to Stored Cross-Site Scripting via the full name value in versions up to, and including, 21.5 due to insufficient input sanitization and output escaping. This makes it possible for unauthenticated attackers to inject arbitrary web scripts in pages that w...
PUBLISHED: 2023-03-17
The WP Express Checkout plugin for WordPress is vulnerable to Stored Cross-Site Scripting via the ‘pec_coupon[code]’ parameter in versions up to, and including, 2.2.8 due to insufficient input sanitization and output escaping. This makes it possible for authenti...
PUBLISHED: 2023-03-17
A vulnerability was found in SourceCodester Student Study Center Desk Management System 1.0. It has been rated as critical. This issue affects the function view_student of the file admin/?page=students/view_student. The manipulation of the argument id with the input 3' AND (SELECT 2100 FROM (SELECT(...
PUBLISHED: 2023-03-17
A vulnerability classified as critical has been found in SourceCodester Student Study Center Desk Management System 1.0. Affected is an unknown function of the file Master.php?f=delete_img of the component POST Parameter Handler. The manipulation of the argument path with the input C%3A%2Ffoo.txt le...
PUBLISHED: 2023-03-17
A vulnerability classified as critical was found in SourceCodester Student Study Center Desk Management System 1.0. Affected by this vulnerability is an unknown functionality of the file admin/?page=reports&date_from=2023-02-17&date_to=2023-03-17 of the component Report Handler. The manipula...