raceBERT -- A Transformer-based Model for Predicting Race and Ethnicity from Names
This work addresses the need for accurate demographic prediction from names, which is useful for social science and fairness research, but it is incremental as it builds on prior methods by replacing LSTMs with transformers.
The paper tackled the problem of predicting race and ethnicity from names by developing raceBERT, a transformer-based model, which achieved state-of-the-art results with an average f1-score of 0.86, a 4.1% improvement overall and 15-17% improvements for non-white names.
This paper presents raceBERT -- a transformer-based model for predicting race and ethnicity from character sequences in names, and an accompanying python package. Using a transformer-based model trained on a U.S. Florida voter registration dataset, the model predicts the likelihood of a name belonging to 5 U.S. census race categories (White, Black, Hispanic, Asian & Pacific Islander, American Indian & Alaskan Native). I build on Sood and Laohaprapanon (2018) by replacing their LSTM model with transformer-based models (pre-trained BERT model, and a roBERTa model trained from scratch), and compare the results. To the best of my knowledge, raceBERT achieves state-of-the-art results in race prediction using names, with an average f1-score of 0.86 -- a 4.1% improvement over the previous state-of-the-art, and improvements between 15-17% for non-white names.