LGJun 26, 2022

How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methods

arXiv:2206.14583v27 citationsh-index: 5
Originality Incremental advance
AI Analysis

It addresses the problem of accurately estimating race/ethnicity in voter data for researchers, highlighting incremental improvements with machine learning and state-specific validation needs.

This paper benchmarks Bayesian Improved Surname Geocoding (BISG) against machine learning methods for proxying race/ethnicity in voter files, finding that machine learning consistently outperforms BISG in individual classification, but both methods show divergent biases and performance varies across states.

Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, machine learning consistently outperforms BISG at individual classification of race/ethnicity. Second, BISG and machine learning methods exhibit divergent biases for estimating regional racial composition. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes