CL IRMar 14, 2024

Evaluating LLMs for Gender Disparities in Notable Persons

Lauren Rhue, Sofie Goethals, Arun Sundararajan

arXiv:2403.09148v13.45 citations

Originality Synthesis-oriented

AI Analysis

It addresses fairness concerns in LLMs for factual retrieval, highlighting persistent gender disparities, though it is incremental in evaluating existing models.

This study investigated gender-based biases in Large Language Models (GPT-3.5 and GPT-4) when retrieving factual information, finding discernible gender disparities in responses, with GPT-4 showing improvements but not fully eradicating issues like declinations.

This study examines the use of Large Language Models (LLMs) for retrieving factual information, addressing concerns over their propensity to produce factually incorrect "hallucinated" responses or to altogether decline to even answer prompt at all. Specifically, it investigates the presence of gender-based biases in LLMs' responses to factual inquiries. This paper takes a multi-pronged approach to evaluating GPT models by evaluating fairness across multiple dimensions of recall, hallucinations and declinations. Our findings reveal discernible gender disparities in the responses generated by GPT-3.5. While advancements in GPT-4 have led to improvements in performance, they have not fully eradicated these gender disparities, notably in instances where responses are declined. The study further explores the origins of these disparities by examining the influence of gender associations in prompts and the homogeneity in the responses.

View on arXiv PDF

Similar