* * *
Shortly after the country’s founding, the U.S. government began collecting data on the racial and ethnic make-up of every person in each household. Every decennial ushers in some new language meant to enhance the accuracy and reliability of the census as a measurement of the entire national population. There’s symbolic power in being represented on the census—in being counted. But as the political scientist Melissa Nobles shows in her book Shades of Citizenship, these data also track compliance with civil-rights legislation, particularly voting districts. They are linked to federal resources, intensifying public agitation around the categories.
During the years between each census, researchers, activists, politicians, and interest groups lobby for the rewording of a label, the addition (or elimination) of a category, or the disaggregation of another, such as Asian or American Indian or Alaska Native. In 2000, for example, “Hispanic or Latino, or Spanish origins” was reclassified from racial to ethnic data. Respondents were also allowed to select multiple boxes to reflect multiracial heritage for the first time. Additional changes that affect how the racial makeup of the country is represented are underway, including the creation of a separate category for people of Middle Eastern and North African descent (referred to as MENA).
Shifts in racial classifications raise questions about what exactly is being counted, how people interpret the same questions differently, and what to do about people’s changing perceptions of their racial background. In 2015, the Pew Research Center reported that at least 9.8 million people reported a different racial or ethnic background than they did in 2000. When someone appears to “change” races, the resulting data is sometimes construed as erroneous.
The statistical accounting used to correct such errors is commonly referred to as “data cleaning” or data cleansing. This process involves identifying and then editing data already collected—through modification, enhancement, or deletion of responses—when it does not conform to some predetermined rules that standardize the data set. Ostensibly, the goal is to improve data quality by correcting measurement errors generated by people who complete the questionnaires or enter responses into the database. Data cleaning hopes to make a final data set similar to other, related ones, such as the other national censuses and the American Community Survey.
Errors in reporting and recording certainly do happen. But if racial data must be cleaned, then some data is dirty. And that dirtiness is undeniably political. Some responses are more likely to be diagnosed as dirty. Given the goal of creating information that is comparable from one national census to the next, the data most under suspect are those that correspond to the categories most in flux: people who checked more than one box, for example, or those who saw themselves as members of different racial or ethnic groups at different times.