AI's Data Dilemma

In This Story

People Mentioned in This Story

Artificial intelligence (AI) has become a ubiquitous solution, from AI-enabled messaging platforms that suggest pre-written responses that you send with the click of a button, to conversational agents like ChatGPT, which stand to change the way we seek and present information and transform entire industries.

Despite all its promises, one of the areas in which AI falls short is in its ability to support the diversity of societal identities. Most notably are inherent racial and gender biases, where the outcomes of integrating these technologies disproportionately disservice members of society with certain racial backgrounds and gender identities.

Brittany Johnson-Matthews, an assistant professor in the Department of Computer Science at George Mason University College of Engineering and Computing, recently received a $750,000 grant from the National Science Foundation (NSF) for Engaging Marginalized Groups to Improve Technological Equity,” which looks at the root causes of why such biases are embedded in technical products and will offer a toolkit to aid those designing these systems. This work is being done in collaboration with Angela D. R. Smith, an assistant professor in the School of Information at the University of Texas at Austin, along with external collaborators Meme Styles, Paulette Blanc, and Precious Azurée of the MEASURE organization and Denae Ford Robinson, a senior researcher at Microsoft Research.

This is an AI-generated photo. A diverse group of people look into the sky at a giant microscope, which is showing a generic looking entity.
This AI-generated picture is what Bing's Image Creator produced when prompted with "AI dataset bias."

When asked what motivated the grant proposal, Johnson-Matthews said, “We’ve seen rapid innovation in technology. We’re seeing more from AI, not just in terms of what we’re doing in research, but even more broadly in society. We’re seeing commercials about it, platforms that are adopting it for existing tasks and features, and so on. But we should want to know two very important things ‑ are people using these technologies and are they helpful? Or more specifically, what kind of biases and problems are being introduced that prevent the ability to support all users?”

As a relatively innocuous example, Johnson-Matthews, a Black woman with locs, notes that if she searches for “wedding hairstyles” she sees a lot of results that are not relevant for her. To get something useful, she would need to search for, for example, “Black women wedding hairstyles.” On a more serious note, she noted that studies have shown that search results for minority names are more likely to bring up negative results (e.g., arrest records) first, which in many cases is not what one is searching for. Another recent example revealed that hospitals using an AI algorithm to determine patient care ultimately gave Black patients lower risk scores than white patients and did not give those patients the same level of care as white patients with the same needs.

Johnson-Matthews says that a large part of the problem is the data sets that are being used to inform these technologies are themselves biased; they don’t include enough representation from minority and underserved populations. “The experiences of the population building the technology is going to reflect how the technology operates. It’s a product of the position of the people creating it,” she said.

Thus, products like end-user AI systems often don’t reflect the broader society’s needs or perspectives but rather the subset typically available in datasets, frequently representing the majority. Johnson-Matthews and her collaborator want to examine the barriers to engaging and amplifying marginalized voices in technological spaces and learn how to engage marginalized groups when designing and developing data-centric systems without sacrificing their safety, security, and trust.

Johnson-Matthews said getting participation is essential but challenging. “There is a piece missing. We’re aware our datasets are not representative, and that non-representative data is impacting the performance of our models – but no one is looking at the connecting tissue to better understand why we see a lack of representation in our datasets and what we can do about it.”

The ultimate goal for the proposed research is to develop an open-source toolkit that supports technologists in responsibly curating, sharing, and making use of representative datasets in the development of AI technologies. More details are at

Johnson-Matthews said, “The issues we see with tech are deeper than the tech itself, so the only way we can provide adequate solutions is by better understanding the root cause of the problems and who they affect most. Our research is an important step towards build a healthy, productive relationship between those who innovate and those who engage with those innovations in the era of artificial intelligence.”