CVMar 13, 2023Code
Instate: Predicting the State of Residence From Last NameAtul Dhingra, Gaurav Sood
India has twenty-two official languages. Serving such a diverse language base is a challenge for survey statisticians, call center operators, software developers, and other such service providers. To help provide better services to different language communities via better localization, we introduce a new machine learning model that predicts the language(s) that the user can speak from their name. Using nearly 438M records spanning 33 Indian states and 1.13M unique last names from the Indian Electoral Rolls Corpus (?), we build a character-level transformer-based machine-learning model that predicts the state of residence based on the last name. The model has a top-3 accuracy of 85.3% on unseen names. We map the states to languages using the Indian census to infer languages understood by the respondent. We provide open-source software that implements the method discussed in the paper.
LGApr 20, 2023
Scaling ML Products At Startups: A Practitioner's GuideAtul Dhingra, Gaurav Sood
How do you scale a machine learning product at a startup? In particular, how do you serve a greater volume, velocity, and variety of queries cost-effectively? We break down costs into variable costs-the cost of serving the model and performant-and fixed costs-the cost of developing and training new models. We propose a framework for conceptualizing these costs, breaking them into finer categories, and limn ways to reduce costs. Lastly, since in our experience, the most expensive fixed cost of a machine learning system is the cost of identifying the root causes of failures and driving continuous improvement, we present a way to conceptualize the issues and share our methodology for the same.
CYMar 9Code
Social Proof is in the Pudding: The (Non)-Impact of Social Proof on Software DownloadsLucas Shen, Gaurav Sood
Open-source software is widely used in commercial applications. Pair that with the fact that when choosing open-source software for a new problem, developers often use social proof as a cue. These two facts raise concerns that bad actors can game social proof metrics to induce the use of malign software. We study the question using two field experiments. On the largest developer platform, GitHub, we buy 'stars' for a random set of GitHub repositories of new Python packages and estimate their impact on package downloads and broader repository activity. We find no discernible impact on downloads, nor on forks, pull requests, issues, or other measures of developer engagement. In another field experiment, we manipulate the number of human downloads for Python packages. Again, we find no detectable effect on subsequent downloads or on any measure of repository activity.
APMay 5, 2018
Predicting Race and Ethnicity From the Sequence of Characters in a NameRajashekar Chintalapati, Suriyan Laohaprapanon, Gaurav Sood
To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.