With thousands of machine learning papers coming out every year, there's a lot of new material for engineers to digest in order to keep their knowledge up to date. New data protection regulations popping up every year and increasing scrutiny on personal data security add another layer of complexity to the very core of effective machine learning: good data. Here's a quick cheat sheet for data protection best practices.
1. Make Sure You're Allowed to Use the Data
The General Data Protection Regulation (GDPR), which applies to any EU citizen wherever they are in the world, requires privacy by design (including privacy by default and respect for user privacy as a foundational principles). This means that if you're collecting any data with personally identifiable information (PII), you must minimize the amount of personal data collected, specify the exact purposes of the data, and limit its retention time.
GDPR also requires collectors to get positive consent (implicit consent does not suffice) for the collection and use of personal data. What this means is that a user has to explicitly give you the right to use their data for specific purposes. Even open source datasets can sometimes contain personal data such as Social Security numbers. It's incredibly important to make sure that the data you're using is properly scrubbed.
2. Data Minimization Is a Godsend
Data minimization refers to the practice of limiting the amount of data that you collect to only what you need for your business purpose. It is useful for data protection regulation compliance and as a general cybersecurity best practice (so your eventual data leak ends up causing much less harm). An example of data minimization is blurring faces and license plate numbers from the data collected for training self-driving cars.
Another example is removing all direct identifiers (e.g., full name, exact address, Social Security number) and quasi-identifiers (e.g., age, religion, approximate location) from customer service call transcripts, emails, or chats, so it's easier to comply with data protection regulations while protecting user privacy. This has the additional benefit of reducing an organization's risk in case of a cyberattack.
3. Beware of Using Personal Data When Training ML Models
Simplistically, machine learning models memorize patterns within training data. So if your model is trained on data that includes PII, there's a risk that the model could leak user data to external parties while in production. Both in research and in industry, it has been shown that personal data present in training sets can be extracted from machine learning models.
One example of this is a Korean chatbot that was spewing out their users' personal information in production because of the real-world personal data their chatbot had been trained on: "It also soon became clear that the huge training dataset included personal and sensitive information. This revelation emerged when the chatbot began exposing people's names, nicknames, and home addresses in its responses."
Data minimization helps dramatically mitigate this risk, which is also significant when it comes to the right to be forgotten in the GDPR. It is still ambiguous what this means for ML models trained on a user's data who's subsequently exercised this right, with one possibility being having to retrain the model from scratch without that specific individual's data. Can you imagine the nightmare?
4. Keep Track of All Personal Data Collected
Data protection regulations, including the GDPR, often require organizations to know the locations and the usage of all PII collected. If redacting the personal data isn't an option, then accurate categorization is essential in order to comply with users' right to be forgotten and with access to information requests. Knowing exactly what personal information you have in your dataset also allows you to understand the security measures needed to protect the information.
5. The Appropriate Privacy-Enhancing Technology Depends on Your Use Case
There's a general misunderstanding that one privacy-enhancing technology will solve all problems — be it homomorphic encryption, federated learning, differential privacy, de-identification, secure multiparty computation, etc. For example, if you need data to debug a system or you need to look at the training data for whatever reason, federated learning and homomorphic encryption are not going to help you. Rather, you need to look at data minimization and data augmentation solutions like de-identification and pseudonymization that replaces personal data with synthetically generated PII.