How Ingrained Biases Show Up in Data Management
One of the little ways I try to apply an equity lens to my work is by being very aware of my own biases, especially around race. This comes up as a practical matter more than you might expect in data management, especially if you’re working with data sets focused on individual people.
I’m sure there are more egregious examples, but one thing I’ve seen and found particularly jarring is the abbreviation “BG” in front of an ethnicity. What does BG stand for? Best Guess. In other words, whoever collected the data or set up the database structure to include these value options decided that having ethnicity data on a person was more important than their own racial/ethnic identity. This insidious little practice takes away individual autonomy and conflates ethnicity with skin color, erasing actual identity in favor of having more complete statistics.
This isn’t an uncommon tension, either. The tug of war between big data and personal privacy has often been discussed in the mainstream media, but I haven’t seen as much discussion of the decisions companies make on whether to trust a data point. Everyone has to have a bar. Do we assume that you’re gay because you shopped at a certain website, or belong to an LGBT organization? Do we assume you’re a man because you purchased a tie? In the nonprofit world, the data tends to be a little more solid, but I’ve still had people ask questions like “I know that X is queer, can I mark them as such in the data?” or “I’m pretty sure this person is trans–can I just ask them so we can update the data?” If we’re going to get reliable data about personal information and respect privacy, then we need to train those collecting data not to make assumptions, and teach them how to ask a question respectfully if we do expect them to ask it.
Another example I’ve come across relates to names. When working with large data sets, the duplicate is the enemy of the good, and often a database manager needs to decide whether to merge two potentially duplicate records based on some criteria. Particularly if you’re doing this manually, or with some manual oversight, it can be more tempting to merge records for two people with the exact same name if it’s an uncommon name than if there are two “Joe Smith”s. Based on probability, we’d assume that the uncommon name is more likely to be a duplicate and the common name is more likely to belong to two different people. But what is a “common” name?
When I’m working with white colleagues on data management, I tend to warn about the instinct to merge what seems like an “uncommon name” but is actually just not white. If I’m not familiar with a culture, I’m much more likely to leave two records alone than to merge them, because I’m just not sure how common that name is. I’m sure plenty of Patels and Singhs and Nguyens have been merged by well-meaning white interns who just didn’t take a moment to think about ingrained biases.
Interested in this topic? I’ll be presenting Make Your Data Trans-Inclusive with B. Cordelia Yu in just a few weeks at #NTC17 (the Nonprofit Technology Conference put on by NTEN). If you’re planning to attend, be sure to add this Thursday afternoon session to your agenda!