Data De-Identification: Balancing Privacy, Efficacy & Cybersecurity

Companies must do a delicate dance between consumer privacy protection, upholding their product's efficacy, and de-risking cyber breaches to run the business.

Ayan Halder, Principal Product Manager, Arkose Labs

November 27, 2023

4 Min Read
1s and 0s converge on a keyhole shape
Source: Science Photo Library via Alamy Stock Photo

COMMENTARY

Global data privacy laws were created to address growing consumer concerns about individual privacy. These laws include several best practices for businesses about storing and using consumers' personal data so that the exposure of personally identifiable information (PII) is limited in case of a data breach.

However, several recent data breaches prove that consumer data continues to stay vulnerable. Why is it that such strict regulations have not been able to safeguard consumer data — beyond generating ad-hoc revenue by penalizing a few businesses that blatantly flout privacy concerns? The answer may lie in how companies need to do a delicate dance between consumer privacy protection, upholding their product's efficacy, and de-risking cyber breaches.

Data De-Identification Weaknesses in the Digital World

There are two primary laws guiding online privacy: the General Data Protection Regulation (GDPR) and the California Privacy Rights Act (CPRA), although many countries and states have started to write their own. Among the various safeguard measures, data de-identification is a prime one.

Both define data de-identification as the process of making PII anonymized in a way that any piece of secondary information, when associated with the personal data, cannot identify the individual. The industry unanimously agrees on some entities as personal data including a name, address, email address, and phone number. Others, such as an IP address (and versions of it) are based on interpretation. These laws neither explicitly list the attributes that are personal nor do they mention how and when to anonymize, beyond sharing a few best practices.

However, full anonymization of personal data and the data linked to it is useless to businesses in this ever-digital world. Every new technological breakthrough demands massive input of data sets — both personal and aggregated. As an example, companies need to maintain non-anonymized data sets for their users to validate login attempts, prevent account takeovers, provide personalized recommendations, and more. A financial institution needs several key pieces of personal data to comply with know-your-customer (KYC) rules; for example, an e-commerce provider needs its end user's delivery address.

Such use cases cannot be fulfilled with completely de-identified data sets. Hence, companies use a process known as pseudo-anonymization, an irreversible data hashing technique that involves converting personal data into a string of random characters that can't be reverse engineered. But this technique has a serious flaw: Rehashing the same personal data yields the same string of random characters.

In the event of a data breach, if the hacker gets access to a database of pseudo-anonymized personal data and the key (also referred to as the salt) used to pseudo-anonymize the personal data, they could infer the actual consumer data just by running multiple lists of breached personal data available in the Dark Web and matching the output by sheer brute force. What's worse: Individual device and browser metadata is almost always stored in raw format, making it easier for the hacker to run associations and get past fraud-detection systems.

If the hacker gets access to a financial institution's database containing pseudo-anonymized personal phone numbers along with a range of browser and device attributes that are tied to the end user, the hacker can run possible phone number combinations through the same algorithm and match the output with the database. Running all possible phone numbers in the United States through a typical SHA-256 cryptographic algorithm takes less than two hours on a modern MacBook. Running the match will take even less time.

Using the phone number, the browser, and device attributes, an attacker can perform an account takeover attempt. Even worse, they can trigger a phishing message, potentially leading to hijacked cookies or tokens and replaying those attributes to gain access to the end user's financial account.

Safeguarding Consumer Data in the Era of Pseudo-Anonymization

Safeguarding personal data requires constant monitoring and threat mitigation against sophisticated hackers. On the data infrastructure side, privacy vaults can disassociate sensitive data from the business' core infrastructure. In case of a breach, sensitive data stays in a secluded vault. Using separate infrastructures for storing the key (salt) to the pseudo-anonymized data is also recommended to reduce breach impact.

Other recommendations include rotating the key at an optimum interval (typically, every three months). Once rotated, the key can unlock the personal data only until that time, reducing the volume of data at risk. Creating multiple keys is an additional defense technique. Beyond the one key used to unlock personal data, storing additional "dummy" keys confuses hackers on which key to use. Each additional dummy key exponentially increases the time to unlock the data, thus buying additional time for the business to take mitigation steps.

Anonymizing nonpersonal information, such as device and network data related to the consumer, also increases the complexity for the hacker since now they have more data to unlock with possibly higher cardinalities than the personal data itself.

While businesses should take proactive monitoring and mitigation measures, not every proactive measure can thwart every attack. Hence, strong retroactive mitigation measures are recommended equally.

About the Author(s)

Ayan Halder

Principal Product Manager, Arkose Labs

As the Principal Product Manager at Arkose Labs, Ayan leads Detection product strategy and development to help businesses identify and secure against botnet attacks and human-driven fraud. Ayan has over 10 years of professional experience across several domains and over five years exclusively in the fraud detection space at TeleSign and Arkose Labs. He has a deep interest and domain expertise in the fraud detection space and actively writes about the opportunities and challenges of this growing space.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights