Purchasing databases from data brokers can create a problem for enterprise security executives. While there are tools to scan the files for malware, there is no automated way to make sure that the data contained in the database is accurate and, even more importantly, was obtained with proper consent. Without that assurance, those files can pose a threat to the enterprise's security compliance and may even open up the company to litigation.
Consider this scenario: Business unit leaders perform an exhaustive due diligence effort before purchasing databases from a data broker. The data has been widely distributed within the organization's global systems. Six months later, law enforcement authorities move against the data broker and report that all of its data was improperly obtained. The organization now has a compliance nightmare on its hands.
The organization might want to delete all of that data to comply with regulations. However, if the team did not tag the data when it was initially loaded into the system, it will be difficult to track and remove it. Even if the data was tracked successfully, it could have become so interwoven with petabytes of other data that it is no longer viable to extract.
On top of this, some regulators may apply the legal concept of "the fruit of the poisonous tree." That doctrine is typically used when law enforcement is accused of not obtaining a search warrant properly. If a judge finds that they indeed did act improperly, the fruit doctrine would not only exclude any evidence found during the search, but also anything found as a result of what was found in the search.
In the case of data, a strict regulator might insist that not only must a company delete the data broker's information, but also any information that resulted from processing that data. In other words, the analytics done on that data might have to be deleted as well.
Tracking Data as It Flows
Another major complicating factor with data compliance is that the folders of information that come from data brokers often reflect work done over many years. That means much of it stems from a time, a place, and a vertical where the rules were different.
"Due to the increasing regulatory compliance framework regarding data collection notice and consent, there are data brokers that have huge subsets of their data that is not 'clean' and they cannot make reps and warranties about it to third parties that want to leverage that data," says Sean Buckley, an attorney with law firm Dykema who specializes in data privacy issues. "The risk to the data broker circles back to whether their data is 'clean' and whether they can prove it if necessary."
ClearData CISO Chris Bowen argues that data tracking is critical when dealing with purchased files, but it can also prove quite difficult — even impossible — if the organization didn't tag it sufficiently from the beginning.
"You need to closely track where the data lives and where it flows," Bowen says. "You need to tag the source of each field in the database. You need consistent links through petabytes of data, structured and unstructured."
Most security executives are not comfortable with this approach because dataflow analysis is outside of their usual remit, he adds.
"Where [data] flows and how it's distributed and how it is archived and destroyed, that's usually more the purview of the privacy office," Bowen says. "You need to protect and track the data through every element of its life cycle."
Critically, Bowen stresses that once new datasets are built on top of the data broker information, "it's darn near impossible to uncouple that data. It would take an act of AI to decouple and unwind all of that."
Putting AI to Work
That AI point is exactly where some other data specialists see this argument headed. They anticipate large language models (LLMs), such as ChatGPT, will be able to track the data through unlimited analytics efforts. In two to five years, the LLM approach may be effective enough for regulators to rely on it.
"Companies today use [the difficulty of data tracking] as an excuse to not produce the evidence. With the advent of machine learning models, that is no longer the case," says Brad Smith, a managing director at consulting firm Edgile.
Detailed tracking of the data throughout its life cycle is key to solving the data broker problem, he says.
"When you pull data in from an external organization, there is always going to be some level of liability. The solution is to maintain data lineage. Generally, when you move information, transfer or copy, or the data somehow morphs from one system to another, that lineage is broken," Smith says. "With the large language model, every piece of data exists in its original state. Those mappings exist in the neural network they have created."
The cloud also plays a critical role here, he adds.
"The only thing that they have to do is move their data into a hyperscale infrastructure," Smith says. "When regulators become aware of this and the [enterprise] hasn't sufficiently invested in Azure or AWS, they'll ask, 'Why haven't you moved to that platform?'"
Avoiding Tainted Data
Fundamentally, some believe that businesses purchase third-party data from data brokers too quickly and that they should first do serious examination of the data they already have or can collect directly.
"There is an open acknowledgement that the quality of third-party data is not good and that it's collected in a pretty dubious manner. Their definition of consent is spotty. Overall, the way data brokers get their data flies in the face of global privacy laws," says Stephanie Liu, a privacy analyst with Forrester.
"It's shocking how quickly we've normalized the aggregation of data that, just a few years ago, would have been considered an egregious intrusion of privacy," adds Rex Booth, the CISO for SailPoint. "Now the only delineation of right and wrong regarding brokers is whether they broke laws in gathering their data."
When figuring out the data broker challenge, CISOs must factor in how the data is being used now and how it will likely be used in a year," he says. "Is it being used to make decisions about who gets a loan or an apartment? Is the resultant data visible to customers, or is it entirely internal, such as data to help sales know who to contact?
Saugat Sindhu, a senior partner who heads the strategy and risk practice at consulting firm Wipro, says almost all data brokers provide deliverables in an anonymized fashion, but it often doesn't stay that way. "You can easily deanonymize an identity," he says.
In some cases, Sindhu says, the compliance remedy may go beyond data deletion to assessing revenue generated by the improperly created data: "You didn't do anything wrong knowingly, but you still made profits off of it and that may raise a fair trade issue," he says. "At the end of the day, tainted data is tainted data."