Data Bias in Machine Learning: Implications for Social Justice

Take historically biased data, then add AI and ML to compound and exacerbate the problem.

Machine learning and artificial intelligence have taken organizations to new heights of innovation, growth, and profits thanks to their ability to analyze data efficiently and with extreme accuracy. However, the inherent nature of some algorithms such as black-box models have been proven, at times, to be unfair and lack transparency, leading to multiplicated bias and detrimental impact on minorities.

There are several key issues presented by black-box models, and they all work together to further bias data. The most prominent are models fed with data that is historically biased to begin with, and fed by humans who are biased by nature. In addition, because data analysts can only see the inputs and outputs but not the internal workings of how results are determined, machine learning is constantly aggregating this data, including personal data. But this process lacks transparency on how the data is being used and why. The lack of transparency means that data analysts have no clear view of inputs and outputs, and algorithms are making analyses and predictions about our work performance, economic situation, health, preferences, and more without providing insights into how it came up with its conclusion.

Related Content:

Are Unconscious Biases Weakening Your Security Posture?

Special Report: How IT Security Organizations Are Attacking the Cybersecurity Problem

New From The Edge: How to Protect Vulnerable Seniors From Cybercrime

In the infosec realm, this is important as more security platforms and services increasingly rely on ML and AI for automation and superior performance. But if the underlying software and algorithms for these same products and services reflect biases, they'll simply perpetuate the prejudices and errant conclusions associated with race, gender, religion, physical abilities, appearance and other characteristics. This has implications for both information and physical security, as well as for personal privacy.

One of the most prominent examples of bias presented by these key issues emerges in the justice system and risk scores. In law enforcement, risk scores are used to predict the likeliness or risk of there being a crime committed by a group of people, a person, or in a certain location. When police departments ask "What locations have higher crime rates?" in order to inundate law enforcement in crime-prone areas, they look at geolocation's risk scores. But dispatching more police officers to a certain location equates to more arrests, and the more reported arrests of any kind in that area equates to more officers being sent to the location by the risk score. It's a never-ending cycle.

A study of risk scores conducted by ProPublica found that Black defendants were 77% more likely to be pegged as "higher risk of committing a future violent crime" and 45% were "more likely to be predicted to commit a future crime of any kind." They also found that the risk score formula was "particularly likely to falsely flag Black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants" (emphasis added). 

Recently, Boston Celtics players published an opinion piece in The Boston Globe calling out the various bias implications of facial recognition technology in minority communities. Facial recognition technology, which also uses black-box models, has had a history of misidentifying Black people and people of color. A test run by the ACLU, comparing congressional headshots to mugshots, showed that 40% of those who were misidentified were people of color. Just last year, Robert Julian-Borchak Williams was misidentified by the Detroit Police Department via facial recognition technology for shoplifting.

In healthcare, black-box models are typically used to help professionals make better recommendations on care and treatments based on the patients' demographic, such as age, gender, and income. This is great, until we realize that some data are likely to favor just one treatment, but one generic treatment will not work for everyone. For example, if my colleague and I had the same diagnosis and were recommended the same treatment, the treatment could work on one of us and not the other because of our genetic makeup, which is not accounted for in the algorithm. 

In the end, data in itself is neither good nor bad. But, without transparency of how black-box models project results, it presents skewed information that becomes difficult to reevaluate or fix without insight on the actual algorithm being used. As data professionals, we are responsible for ensuring that the information we are gathering and the results being projected are fair to the best of our knowledge, and most importantly, does no harm, especially to vulnerable and underprivileged communities. It's time we go back to the basics — relying on interpretable models such as regressions and decision trees and understanding the "why" of certain data points before analyzing or extracting the data. Even if it means, at times, sacrificing accuracy for fairness.