Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Endpoint //


12:45 PM

Companies' 'Anonymized' Data May Violate GDPR, Privacy Regs

New study found that any database containing 15 pieces of demographic data could be used to identify individuals.

For more than two decades, researchers have chipped away at the assumption that anonymized collections of data could protect the identities of research subjects as long as the datasets did not include one of a score of different identifying attributes.  

In the latest research highlighting the ease of what is known as "re-identification," three academic researchers have shown that 99.98% of Americans could be re-identified from an otherwise anonymized dataset, if it included 15 demographic attributes.

The findings suggests that even the current policies surrounding the protection of customer identities, such as the General Data Protection Regulation (GDPR), fall short of truly protecting citizens.

In the paper, which appeared in Nature on July 23, the researchers conclude that "even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model." 

The paper adds to the mountain of research suggesting that any dataset that contains useful information about individuals likely could be used to re-identify those subjects and link individuals to information that may be protected by privacy regulations or law. The research could lead to a rethinking of whether all big data sets need to be significantly better protected.

"Many companies think that, if it's anonymous, I don't need to secure it, but the data is likely not as anonymous as they think," says Bruce Schneier, a lecturer at Harvard University's Kennedy School of Management and the author of Data and Goliath, a book about how companies' data collection results in a mass-surveillance infrastructure. "Again and again and again, we have learned that anonymization of data is extremely hard. People are unique enough that data about them is enough to identify them."

The findings mean that companies and government agencies need to reassess how they deal with "anonymized" data, says Scott Giordano, vice president of data protection, Spirion, a providers of data-security services. The US Department of Health and Human Services, for example, currently requires that businesses remove 18 different classes of information from files, or have an expert review their anonymization techniques, to certify data as non-identifying. 

That may not be enough, he says.

"It is too easy, with advances in big data, to de-anonymize things that maybe you couldn't have de-anonymized five years ago," Giordano says. "We are in an arms race between the desire to anonymize data and our collection of big data, and big data is winning."

Zip Code, Gender, and DoB

The concerns over re-identification appeared in the late 1990s, when then-graduate student Latanya Sweeney conducted research into the possibility of combining voter rolls and medical research records on Massachusetts state employees to de-anonymize patients' information. Famously, Sweeney, now a professor of government and technology in residence at Harvard University, was able to find then-Governor William Weld's medical record in the dataset. In a 2000 paper, she estimated that 87% of US citizens could be identified using just three pieces of information: their 5-digit zip code, gender, and data of birth. 

With the collection of a broad range of data proliferating from personal devices — not just from smartphones, but from Apple watches to connected mattresses — technology firms and data aggregators are making choices that affect the rights of US citizens, she argued in a speech at Stanford University's School of Engineering in 2018.

"We live in a technocracy — that is, we live in a world in which technology design dictates the rules we live by," she said. "We don't know these people, we didn't vote for them in office, there was no debate about their design, but yet, the rules that they determined by the design decisions they make — and many of them somewhat arbitrary — end up dictating how we will live our lives." 

The Nature paper, written by a team of three researchers from the Imperial College of London and Belgium's UC Louvain, shows that the massive number of attributes collected about people makes them more unique. For companies, the lesson is that any sufficiently detailed dataset cannot be, by definition, anonymous. Even releasing partial datasets runs the risk of re-identification, the researchers found.

"Moving forward, (our results) question whether current de-identification practices satisfy the anonymization standards of modern data protection laws such as GDPR and CCPA (the California Consumer Privacy Act) and emphasize the need to move, from a legal and regulatory perspective, beyond the de-identification release-and-forget model," the researchers stated in the paper. 

This leaves companies with no easy answers on whether following current guidelines is enough to protect the anonymity of the information in their care, says Pravin Kothari, CEO of CipherCloud, a data-security provider. 

"This finding proves that re-identification is easy, so companies need to make sure they are anonymizing all demographic data, not just names," he says. "The removal of names is simply not enough to properly de-identify a person. We'll need to ensure that all personally identifiable information is anonymized in order to remove the risk of re-identification of individuals."

Related Content

Black Hat USA returns to Las Vegas with hands-on technical Trainings, cutting-edge Briefings, Arsenal open-source tool demonstrations, top-tier security solutions, and service providers in the Business Hall. Click for information on the conference and to register.

Veteran technology journalist of more than 20 years. Former research engineer. Written for more than two dozen publications, including CNET News.com, Dark Reading, MIT's Technology Review, Popular Science, and Wired News. Five awards for journalism, including Best Deadline ... View Full Bio

Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Ninja
8/1/2019 | 7:40:42 AM
18 points that should be removed
Dataset Removal

Even if they removed these key elements, it is still easy to find a person because of facial recognition and individuals who are gatekeepers, they are on the take.

At this point, there is too much data out there on individuals that make it easy to find someone; for example, when you send an email to a site, that information can be queried on Google's search engine; if the user posts anything online or posts a picture, that is tagged and identified.

The user would have to not be online from the time they were born in order to not be online, everything is indexed and tagged now (it comes from Google, Microsoft, Amazon, Apple, Cisco, Feds, State, everyone is doing it, they use bots, data-correlation and key-word tagging).

Expert Risk Assessment

From the article posted by the writer, this image comes from the HHS Hipaa information, but what is the criteria for determing the risk (how do you know it is small or not). After all is said and done, this is going to come down to encrpypting data at rest and on the fly in order to ensure PII information does not get out or if it does, it will take countless measures to break it.

User Rank: Apprentice
7/29/2019 | 4:43:37 PM
Re: Clarification on the GDPR
the researchers conclude that "even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR..." The GDPR is not focusing on datasets but on each information. One can read Recital 27 of the GDPR and understands how the GDPR does not apply to anonymous information. Thus, if working with a dataset or several available databases the anonymization technique has to be applied to each piece of information making sure that none of the remaining unanonymized data does not relate to an identified or identifiable natural person.
User Rank: Ninja
7/26/2019 | 8:17:34 PM
Isn't the police department and other federal agencies capturing our faces (facial recognition)
Ok, maybe I missed something, aren't the US police departments using tools to capture information from license plates; as well as capturing information from traffic cameras and cameras on their uniforms. Agencies are performing some sort of facial recognition when they stop you or at the airports, your informaiton is being captured and recorded in a database. Federal agencies and other organizations are capturing your information and this information is being retrieved from a centralized database. The database is called XKeyScore, so at this point nothing is private. This database creates relational tables where they determine relationships based on success rates. There are other tools but I won't go into detail (Facia, Informant, Stingray, etc).
Even if companies are anonymizing data (jumbling or encrypting the data), the user's information is found on Facebook, Linkedin (social media), public records (clerk of court, where you live and stay) and now they have added facial recognition to the equation. I don't think it is the companies that we need to focus on but the other 3 letter agencies that are using our PII data as a way of determining if we are legitimate or not.
Just something to think about.
A Realistic Threat Model for the Masses
Lysa Myers, Security Researcher, ESET,  10/9/2019
USB Drive Security Still Lags
Dark Reading Staff 10/9/2019
Virginia a Hot Spot For Cybersecurity Jobs
Jai Vijayan, Contributing Writer,  10/9/2019
Register for Dark Reading Newsletters
White Papers
Cartoon Contest
Current Issue
7 Threats & Disruptive Forces Changing the Face of Cybersecurity
This Dark Reading Tech Digest gives an in-depth look at the biggest emerging threats and disruptive forces that are changing the face of cybersecurity today.
Flash Poll
New Best Practices for Secure App Development
New Best Practices for Secure App Development
The transition from DevOps to SecDevOps is combining with the move toward cloud computing to create new challenges - and new opportunities - for the information security team. Download this report, to learn about the new best practices for secure application development.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2019-10-15
A flaw was found in the Keycloak REST API before version 8.0.0 where it would permit user access from a realm the user was not configured. An authenticated attacker with knowledge of a user id could use this flaw to access unauthorized information or to carry out further attacks.
PUBLISHED: 2019-10-15
In haml versions prior to version 5.0.0.beta.2, when using user input to perform tasks on the server, characters like < > " ' must be escaped properly. In this case, the ' character was missed. An attacker can manipulate the input to introduce additional attributes, potentially executing ...
PUBLISHED: 2019-10-15
safer-eval before 1.3.4 are vulnerable to Arbitrary Code Execution. A payload using constructor properties can escape the sandbox and execute arbitrary code.
PUBLISHED: 2019-10-15
safer-eval before 1.3.2 are vulnerable to Arbitrary Code Execution. A payload using constructor properties can escape the sandbox and execute arbitrary code.
PUBLISHED: 2019-10-15
In the DoorDash application through 11.5.2 for Android, the username and password are stored in the log during authentication, and may be available to attackers via logcat.