Race and ethnicity data for first, middle, and last names
This provides a resource for researchers and analysts to impute race and ethnicity in data where such information is unavailable, addressing a need in social science and public policy, though it is incremental as it builds on existing methods like BISG.
The authors tackled the problem of imputing race and ethnicity by compiling the largest publicly available dictionaries of first, middle, and last names, based on voter files from six Southern states, resulting in datasets covering roughly one million first names, 1.1 million middle names, and 1.4 million surnames categorized into five racial and ethnic groups.
We provide the largest compiled publicly available dictionaries of first, middle, and last names for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration. Our data cover a much larger scope of names than any comparable dataset, containing roughly one million first names, 1.1 million middle names, and 1.4 million surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups -- White, Black, Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for every name in each dictionary. Counts can then be normalized row-wise or column-wise to obtain conditional probabilities of race given name or name given race. These conditional probabilities can then be deployed for imputation in a data analytic task for which ground truth racial and ethnic data is not available.