Predicting Race and Ethnicity From the Sequence of Characters in a Name
This provides a more accurate and generalizable tool for researchers studying racial inequality and fairness in areas like campaign finance and media coverage.
The paper tackled the problem of inferring race and ethnicity from names to address limitations of existing Census data, achieving an out-of-sample accuracy of 0.85 with an LSTM model when first names are available and 0.81 for last names only.
To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.