Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Endpoint

5/7/2021
10:00 AM
Rob Simon
Rob Simon
Commentary
Connect Directly
Twitter
RSS
E-Mail vvv
50%
50%

Defending Against Web Scraping Attacks

Web scraping attacks, like Facebook's recent data leak, can easily lead to more significant breaches.

Web scraping is as old as the Internet, but it's a threat that rarely gets its due. Companies frequently underestimate its risk potential because it is technically not a "hack" or "breach." 

A recent example is Facebook, which has tried to downplay its latest massive data leak by claiming the scraping impacted public information only. The company overlooks the risk this type of personal data exposure poses for the victims and the ultimate value of harvesting this data on such a massive scale, particularly for social engineering attacks.  

Related Content:

How Personally Identifiable Information Can Put Your Company at Risk

Special Report: Tech Insights: Detecting and Preventing Insider Data Leaks

New From The Edge: 10K Hackers Defend the Planet Against Extraterrestrials

Scraping sites for user data is nothing new; Facebook has faced this issue on multiple occasions. In 2013, I disclosed two methods of scraping Facebook user data. One involved a tool I created called Facebook Harvester, which utilized the then-recently released Graph Search feature to perform a brute-force search of phone numbers and return any associated user profile. 

Meanwhile, Facebook is still partially vulnerable to malicious scraping through its password-reset page. By entering a phone number, it is possible to pull up privately listed people on the platform — including their full name and profile photo. This is notably different from the method used in the recent data dump, in that the end user does not need to be publicly searchable. While Facebook has tightened the data revealed on this page, it may still prove a useful tool for malicious actors. 

But scraping isn't just a social media problem. It's an issue that affects many types of organizations across various industries. Scraping is one of the methods malicious hackers use to collect intel on companies before they target them with more significant attacks. 

Here is a closer look at this undervalued threat. 

How Attackers Use Web Scraping
Web scraping can easily lead to more significant attacks. At my company, we routinely use Web scraping as one of the initial steps in a red team or phishing engagement. By pulling the metadata from posted documents, we can find employee names, usernames, and deduce username and email formats, which is particularly helpful when the username format would otherwise be difficult to guess. Mix this with scraping a list of current employees from sites like LinkedIn, and an adversary can perform targeted phishing and credential brute-force attacks. 

In one recent example, we determined the client's unique username configuration by collecting documents scraped from the company's public-facing sites. These documents contained the author's first and last name and the file path; because the file was saved within the user's profile path, the path also contained the username. In this case, the format was two letters of the first name, the last name, and a digit. So, if the user's name were John Smith, the username would have been josmith1. Once we found this, it was easy enough to perform credential brute-forcing by using a list of common first and last names to match the discovered username format. By running the attack with just a few common passwords per username, we gained access to at least one account, which gave our red team an initial foothold. 

Scraping document metadata is also useful for detecting internal hostnames and software versions in use at the targeted company. This enables an attacker to customize the attack to exploit vulnerabilities specific to that company, and it is an important part of victim reconnaissance.  

Adversaries can also use scraping to collect gated information from a website if that information isn't properly protected. Take Facebook's password-reset page: Anyone can find privately listed people through a simple query with a phone number. While a password-reset page may be necessary, does it really need to confirm or, worse, return a user's private information? 

While this may be a worst-case scenario, many websites are still vulnerable to user enumeration via simple error messages. I see this often where a registration, login, or password-reset page returns a message like "the username could not be found" when submitting invalid credentials to the login page or for a password reset. While this seems innocent enough, attackers can abuse this notification to determine which usernames or emails exist as registered accounts for the service. A list of valid usernames could be used for more targeted credential brute-force attacks, and valid emails can be used in targeted phishing attacks.  

Controlling the Threat
There are several ways to reduce the risk of Web scraping.

First, organizations should regularly audit their websites to make sure they are not unintentionally exposing sensitive information to public-facing websites through published documents or information stored in back-end databases that are linked through the website. 

Organizations should also have a process in place to strip metadata from documents before they are published externally. They should prevent exposing things such as usernames, file paths, print queues, and software versions, as these can all be useful in mounting an attack. 

Password-reset pages often contain verbose messages that reveal if a submitted username is valid or not. Going back to the Facebook example, should the password-reset page return the full name and profile picture associated with a phone number before sending a reset link? In these instances, the password-reset page reveals unnecessary information. Where possible, pages should return a generic message after a person submits information for a password reset, letting them know a text or email will be sent to the account if it exists. The key is that the page should not indicate whether the account or information is valid.  

Rate limiting and CAPTCHAs are standard defenses against scraping, but a determined attacker may still be able to bypass these measures by using CAPTCHA-solving services or rotating through a list of IP addresses. These measures should make things more difficult for Web scraping but are not a substitute for the proper protection of sensitive data. 

Recognize the Threat
While Web scraping has long been viewed more as an annoyance than a security risk, it is widely used by attackers to gain critical insights into a company, particularly for user enumeration attacks. Implementing some of these security measures can greatly reduce a company's risk.

Rob Simon is a Principal Security Consultant at TrustedSec, where he specializes in Web and mobile applications, as well as hardware security. Rob has more than a decade of experience in information security, with roles ranging from software development to penetration ... View Full Bio
 

Recommended Reading:

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Commentary
What the FedEx Logo Taught Me About Cybersecurity
Matt Shea, Head of Federal @ MixMode,  6/4/2021
Edge-DRsplash-10-edge-articles
A View From Inside a Deception
Sara Peters, Senior Editor at Dark Reading,  6/2/2021
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
The State of Cybersecurity Incident Response
In this report learn how enterprises are building their incident response teams and processes, how they research potential compromises, how they respond to new breaches, and what tools and processes they use to remediate problems and improve their cyber defenses for the future.
Flash Poll
How Enterprises are Developing Secure Applications
How Enterprises are Developing Secure Applications
Recent breaches of third-party apps are driving many organizations to think harder about the security of their off-the-shelf software as they continue to move left in secure software development practices.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2021-23394
PUBLISHED: 2021-06-13
The package studio-42/elfinder before 2.1.58 are vulnerable to Remote Code Execution (RCE) via execution of PHP code in a .phar file. NOTE: This only applies if the server parses .phar files as PHP.
CVE-2021-34682
PUBLISHED: 2021-06-12
Receita Federal IRPF 2021 1.7 allows a man-in-the-middle attack against the update feature.
CVE-2021-31811
PUBLISHED: 2021-06-12
In Apache PDFBox, a carefully crafted PDF file can trigger an OutOfMemory-Exception while loading the file. This issue affects Apache PDFBox version 2.0.23 and prior 2.0.x versions.
CVE-2021-31812
PUBLISHED: 2021-06-12
In Apache PDFBox, a carefully crafted PDF file can trigger an infinite loop while loading the file. This issue affects Apache PDFBox version 2.0.23 and prior 2.0.x versions.
CVE-2021-32552
PUBLISHED: 2021-06-12
It was discovered that read_file() in apport/hookutils.py would follow symbolic links or open FIFOs. When this function is used by the openjdk-16 package apport hooks, it could expose private data to other local users.