Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


// // //
10:00 AM
Rob Simon
Rob Simon
Connect Directly
E-Mail vvv

Defending Against Web Scraping Attacks

Web scraping attacks, like Facebook's recent data leak, can easily lead to more significant breaches.

Web scraping is as old as the Internet, but it's a threat that rarely gets its due. Companies frequently underestimate its risk potential because it is technically not a "hack" or "breach." 

A recent example is Facebook, which has tried to downplay its latest massive data leak by claiming the scraping impacted public information only. The company overlooks the risk this type of personal data exposure poses for the victims and the ultimate value of harvesting this data on such a massive scale, particularly for social engineering attacks.  

Related Content:

How Personally Identifiable Information Can Put Your Company at Risk

Special Report: Tech Insights: Detecting and Preventing Insider Data Leaks

New From The Edge: 10K Hackers Defend the Planet Against Extraterrestrials

Scraping sites for user data is nothing new; Facebook has faced this issue on multiple occasions. In 2013, I disclosed two methods of scraping Facebook user data. One involved a tool I created called Facebook Harvester, which utilized the then-recently released Graph Search feature to perform a brute-force search of phone numbers and return any associated user profile. 

Meanwhile, Facebook is still partially vulnerable to malicious scraping through its password-reset page. By entering a phone number, it is possible to pull up privately listed people on the platform — including their full name and profile photo. This is notably different from the method used in the recent data dump, in that the end user does not need to be publicly searchable. While Facebook has tightened the data revealed on this page, it may still prove a useful tool for malicious actors. 

But scraping isn't just a social media problem. It's an issue that affects many types of organizations across various industries. Scraping is one of the methods malicious hackers use to collect intel on companies before they target them with more significant attacks. 

Here is a closer look at this undervalued threat. 

How Attackers Use Web Scraping
Web scraping can easily lead to more significant attacks. At my company, we routinely use Web scraping as one of the initial steps in a red team or phishing engagement. By pulling the metadata from posted documents, we can find employee names, usernames, and deduce username and email formats, which is particularly helpful when the username format would otherwise be difficult to guess. Mix this with scraping a list of current employees from sites like LinkedIn, and an adversary can perform targeted phishing and credential brute-force attacks. 

In one recent example, we determined the client's unique username configuration by collecting documents scraped from the company's public-facing sites. These documents contained the author's first and last name and the file path; because the file was saved within the user's profile path, the path also contained the username. In this case, the format was two letters of the first name, the last name, and a digit. So, if the user's name were John Smith, the username would have been josmith1. Once we found this, it was easy enough to perform credential brute-forcing by using a list of common first and last names to match the discovered username format. By running the attack with just a few common passwords per username, we gained access to at least one account, which gave our red team an initial foothold. 

Scraping document metadata is also useful for detecting internal hostnames and software versions in use at the targeted company. This enables an attacker to customize the attack to exploit vulnerabilities specific to that company, and it is an important part of victim reconnaissance.  

Adversaries can also use scraping to collect gated information from a website if that information isn't properly protected. Take Facebook's password-reset page: Anyone can find privately listed people through a simple query with a phone number. While a password-reset page may be necessary, does it really need to confirm or, worse, return a user's private information? 

While this may be a worst-case scenario, many websites are still vulnerable to user enumeration via simple error messages. I see this often where a registration, login, or password-reset page returns a message like "the username could not be found" when submitting invalid credentials to the login page or for a password reset. While this seems innocent enough, attackers can abuse this notification to determine which usernames or emails exist as registered accounts for the service. A list of valid usernames could be used for more targeted credential brute-force attacks, and valid emails can be used in targeted phishing attacks.  

Controlling the Threat
There are several ways to reduce the risk of Web scraping.

First, organizations should regularly audit their websites to make sure they are not unintentionally exposing sensitive information to public-facing websites through published documents or information stored in back-end databases that are linked through the website. 

Organizations should also have a process in place to strip metadata from documents before they are published externally. They should prevent exposing things such as usernames, file paths, print queues, and software versions, as these can all be useful in mounting an attack. 

Password-reset pages often contain verbose messages that reveal if a submitted username is valid or not. Going back to the Facebook example, should the password-reset page return the full name and profile picture associated with a phone number before sending a reset link? In these instances, the password-reset page reveals unnecessary information. Where possible, pages should return a generic message after a person submits information for a password reset, letting them know a text or email will be sent to the account if it exists. The key is that the page should not indicate whether the account or information is valid.  

Rate limiting and CAPTCHAs are standard defenses against scraping, but a determined attacker may still be able to bypass these measures by using CAPTCHA-solving services or rotating through a list of IP addresses. These measures should make things more difficult for Web scraping but are not a substitute for the proper protection of sensitive data. 

Recognize the Threat
While Web scraping has long been viewed more as an annoyance than a security risk, it is widely used by attackers to gain critical insights into a company, particularly for user enumeration attacks. Implementing some of these security measures can greatly reduce a company's risk.

Rob Simon is a Principal Security Consultant at TrustedSec, where he specializes in Web and mobile applications, as well as hardware security. Rob has more than a decade of experience in information security, with roles ranging from software development to penetration ... View Full Bio
Comment  | 
Print  | 
More Insights
Oldest First  |  Newest First  |  Threaded View
I Smell a RAT! New Cybersecurity Threats for the Crypto Industry
David Trepp, Partner, IT Assurance with accounting and advisory firm BPM LLP,  7/9/2021
Attacks on Kaseya Servers Led to Ransomware in Less Than 2 Hours
Robert Lemos, Contributing Writer,  7/7/2021
It's in the Game (but It Shouldn't Be)
Tal Memran, Cybersecurity Expert, CYE,  7/9/2021
Register for Dark Reading Newsletters
White Papers
Current Issue
Everything You Need to Know About DNS Attacks
It's important to understand DNS, potential attacks against it, and the tools and techniques required to defend DNS infrastructure. This report answers all the questions you were afraid to ask. Domain Name Service (DNS) is a critical part of any organization's digital infrastructure, but it's also one of the least understood. DNS is designed to be invisible to business professionals, IT stakeholders, and many security professionals, but DNS's threat surface is large and widely targeted. Attackers are causing a great deal of damage with an array of attacks such as denial of service, DNS cache poisoning, DNS hijackin, DNS tunneling, and DNS dangling. They are using DNS infrastructure to take control of inbound and outbound communications and preventing users from accessing the applications they are looking for. To stop attacks on DNS, security teams need to shore up the organization's security hygiene around DNS infrastructure, implement controls such as DNSSEC, and monitor DNS traffic
Flash Poll
How Enterprises are Developing Secure Applications
How Enterprises are Developing Secure Applications
Recent breaches of third-party apps are driving many organizations to think harder about the security of their off-the-shelf software as they continue to move left in secure software development practices.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2023-05-26
Craft is a CMS for creating custom digital experiences. Cross site scripting (XSS) can be triggered by review volumes. This issue has been fixed in version 4.4.7.
PUBLISHED: 2023-05-26
Django-SES is a drop-in mail backend for Django. The django_ses library implements a mail backend for Django using AWS Simple Email Service. The library exports the `SESEventWebhookView class` intended to receive signed requests from AWS to handle email bounces, subscriptions, etc. These requests ar...
PUBLISHED: 2023-05-26
Highlight is an open source, full-stack monitoring platform. Highlight may record passwords on customer deployments when a password html input is switched to `type="text"` via a javascript "Show Password" button. This differs from the expected behavior which always obfuscates `ty...
PUBLISHED: 2023-05-26
Craft is a CMS for creating custom digital experiences on the web.The platform does not filter input and encode output in Quick Post validation error message, which can deliver an XSS payload. Old CVE fixed the XSS in label HTML but didn’t fix it when clicking save. This issue was...
PUBLISHED: 2023-05-26
GDSDB infinite loop in Wireshark 4.0.0 to 4.0.5 and 3.6.0 to 3.6.13 allows denial of service via packet injection or crafted capture file