LGJun 26, 2022
How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methodsAri Decter-Frain
Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, machine learning consistently outperforms BISG at individual classification of race/ethnicity. Second, BISG and machine learning methods exhibit divergent biases for estimating regional racial composition. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.
SIJan 17, 2022
Millions of Co-purchases and Reviews Reveal the Spread of Polarization and Lifestyle Politics across Online MarketsAlexander Ruch, Ari Decter-Frain, Raghav Batra
Polarization in America has reached a high point as markets are also becoming polarized. Existing research, however, focuses on specific market segments and products and has not evaluated this trend's full breadth. If such fault lines do spread into other segments that are not explicitly political, it would indicate the presence of lifestyle politics -- when ideas and behaviors not inherently political become politically aligned through their connections with explicitly political things. We study the pervasiveness of polarization and lifestyle politics over different product segments in a diverse market and test the extent to which consumer- and platform-level network effects and morality may explain lifestyle politics. Specifically, using graph and language data from Amazon (82.5M reviews of 9.5M products and product and category metadata from 1996-2014), we sample 234.6 million relations among 21.8 million market entities to find product categories that are most politically relevant, aligned, and polarized. We then extract moral values present in reviews' text and use these data and other reviewer-, product-, and category-level data to test whether individual- and platform- level network factors explain lifestyle politics better than products' implicit morality. We find pervasive lifestyle politics. Cultural products are 4 times more polarized than any other segment, products' political attributes have up to 3.7 times larger associations with lifestyle politics than author-level covariates, and morality has statistically significant but relatively small correlations with lifestyle politics. Examining lifestyle politics in these contexts helps us better understand the extent and root of partisan differences, why Americans may be so polarized, and how this polarization affects market systems.