CLNov 14, 2024

Evaluating Gender Bias in Large Language Models

Michael Döll, Markus Döhring, Andreas Müller

arXiv:2411.09826v13 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

It addresses gender bias in AI for communication applications, but is incremental as it focuses on evaluating existing models rather than proposing new solutions.

This study evaluated gender bias in large language models (LLMs) by analyzing pronoun selection and name generation in occupational contexts, finding a positive correlation between model outputs and U.S. labor force gender distributions, with prompting methods having a greater impact than model choice.

Gender bias in artificial intelligence has become an important issue, particularly in the context of language models used in communication-oriented applications. This study examines the extent to which Large Language Models (LLMs) exhibit gender bias in pronoun selection in occupational contexts. The analysis evaluates the models GPT-4, GPT-4o, PaLM 2 Text Bison and Gemini 1.0 Pro using a self-generated dataset. The jobs considered include a range of occupations, from those with a significant male presence to those with a notable female concentration, as well as jobs with a relatively equal gender distribution. Three different sentence processing methods were used to assess potential gender bias: masked tokens, unmasked sentences, and sentence completion. In addition, the LLMs suggested names of individuals in specific occupations, which were then examined for gender distribution. The results show a positive correlation between the models' pronoun choices and the gender distribution present in U.S. labor force data. Female pronouns were more often associated with female-dominated occupations, while male pronouns were more often associated with male-dominated occupations. Sentence completion showed the strongest correlation with actual gender distribution, while name generation resulted in a more balanced 'politically correct' gender distribution, albeit with notable variations in predominantly male or female occupations. Overall, the prompting method had a greater impact on gender distribution than the model selection itself, highlighting the complexity of addressing gender bias in LLMs. The findings highlight the importance of prompting in gender mapping.

View on arXiv PDF

Similar