Data science and analytics fall in an exciting intersection of math and science. The pattern recognition, heavy statistics, and calculus of a machine learning algorithm interpreted through a tidy process of iteration with steps for a final output is something I find incredibly cool. All of this comes with an important responsibility to balance hard science and mathematics and examine them through an ethical lens. There is an assumption in the hard sciences that math is the pinnacle of unbiased research and that it leaves no room for integrating ethics. This line of thought concludes algorithms are the ultimate truth and that there shall be no arguing with the numbers.
But what happens when the numbers are biased? When they are only telling half-truths? How can this be diagnosed and mitigated? Data scientists and analysts are usually brave and keen enough to sift through these questions when they build models, and as I work to navigate these questions myself I wanted to provide a couple of additional ethics tools in the data science and analytics toolbox.
There are plenty of resources discussing why data science should use an ethical lens when approaching machine learning and artificial intelligence. The Data Science and Engineering team at DataDrive just finished reading Weapons of Math Destruction for our quarterly book club and it laid out a lot of the issues with algorithms we encounter on a daily basis. In the book, O’Neill brushes broad strokes on how to mitigate Weapons of Math Destruction but doesn’t expand on how to establish a daily practice of increasing awareness of ethics in data.
Hopefully, after reading through the following checklist to promote ethical data science, you will be able to look beyond the logic of math as a universal truth and diagnose harmful ethical dilemmas in algorithms and models. More importantly, you’ll have some additional tools to fight back against Weapons of Math Destruction to foster a better world in technology.
The Ethical Data Science Toolkit
Here are four ways to decrease bias in your data algorithms (Thank you, Cathy O’Neil!)
- Understand Privacy
- It’s about more than protecting passwords and email (even though that should be an important aspect in thinking about privacy in your modeling). Laws like the GDPR in Europe or HIPPA in the United States are examples of establishing ethics in data privacy on a national level and provide guidelines for using, storing, and sharing personal data.
- Ways to mitigate: Ask yourself if you are pulling in extraneous information for your analysis. Could some columns in your dataset that contain personal data be removed prior to crunching numbers? Is there data that you can aggregate or scrub prior to analysis? I also like to do a gut check with myself. If my [zip code, gender, address, medical records] were released if this model was published, would I be upset? What about security? Does this get handed off in a siloed fashion to the cybersecurity team? Consider breaking down the walls and having a conversation with the security team about how to look at security during the build instead of waiting until it’s finished to consider privacy.
- Ensure You Have an Unbiased Training Data Set
- While easy to assume at first, this can be really tricky. Consider the saying “You don’t know what you don’t know.” How do you know if you have a biased training set?
- Ways to mitigate: Is the training data set you are using representative of the whole? For example, let’s say you are running a model to predict the number of hours citizens in Minneapolis spend working during the week based on a number of metrics (gender, age, race, education). All of these metrics should be represented in the same ratio in your training set as they exist city-wide. It would also be beneficial to use training data that comes from all of the neighborhoods in the same ratio to make it representative of the whole. In Minneapolis, for example, the outcome of running an income-based model in the Kenwood neighborhood would be very different from Phillips. Representation matters.
- Here is an additional article from the Brookings Institute that walks through some great examples of bias in data and provides some mitigation strategies in practice.
- Develop Transparency in Your Methods
- We all feel more comfortable using an algorithm when we understand the decision-making of a model. While there will always be some sort of black box in AI, creating transparency in the data used and how it is manipulated, as well as some understanding of the outcome is imperative in trusting the solution.
- How to mitigate: I love the way this Medium article breaks down the transparency of data and the importance of communication. By stating intentions, processes, datasets, outcomes, failures (and how they were fixed), transparency can be achieved to garner trust in a model. It’s about more than just spitting out a recommendation generator. Ethical data science is about building trust in a model through clear communication about the process.
- Zero in on Your Blind Spots
- Going back to the second point, it’s hard to know what you don’t know, but there are strategies you can use to start finding your blind spots.
- Ways to Mitigate: Disaggregating your results can help you zoom in on your blind spots. This has been well documented in the facial recognition arena. Aggregated results may show the algorithm is correctly identifying most people, but when disaggregated, it is shown that most models misidentify people of color and women. Try this with your outcomes on a number of metrics to see if you can identify any blind spots. Just like defending a thesis, poke holes in your results and in your model.
These are just a handful of ways we can increase our impact on the data we are putting out into the world to make it more robust, trustworthy, and meaningful. After all, an algorithm is only as good as the people writing its code. These simple steps will help create a community around an ethical data science practice while promoting a better system for analysis.