Instate: Predicting the State of Residence From Last Name
This helps service providers like survey statisticians and call centers improve localization for users across India, though it is incremental as it applies an existing method to new data.
The paper tackles the challenge of serving India's diverse language communities by predicting a user's state of residence from their last name, achieving a top-3 accuracy of 85.3% on unseen names.
India has twenty-two official languages. Serving such a diverse language base is a challenge for survey statisticians, call center operators, software developers, and other such service providers. To help provide better services to different language communities via better localization, we introduce a new machine learning model that predicts the language(s) that the user can speak from their name. Using nearly 438M records spanning 33 Indian states and 1.13M unique last names from the Indian Electoral Rolls Corpus (?), we build a character-level transformer-based machine-learning model that predicts the state of residence based on the last name. The model has a top-3 accuracy of 85.3% on unseen names. We map the states to languages using the Indian census to infer languages understood by the respondent. We provide open-source software that implements the method discussed in the paper.