CLSep 17, 2024

Enriching Datasets with Demographics through Large Language Models: What's in a Name?

Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat

arXiv:2409.11491v14.25 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses a critical need in healthcare, public policy, and social sciences for more precise demographic insights, though it is incremental as it builds on existing LLM capabilities.

The paper tackled the problem of enriching datasets with demographic information from names, demonstrating that zero-shot Large Language Models (LLMs) can perform as well as or better than specialized models, with applications in real-life datasets like financial professionals in Hong Kong.

Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong, and critically assess the inherent demographic biases in these models. Our work not only advances the state-of-the-art in demographic enrichment but also opens avenues for future research in mitigating biases in LLMs.

View on arXiv PDF

Similar