GenderBench: Evaluation Suite for Gender Biases in LLMs
This addresses the problem of gender bias in LLMs for researchers and practitioners, providing a tool for reproducible benchmarking, though it is incremental as it builds on existing bias evaluation efforts.
The authors tackled the problem of measuring gender biases in large language models (LLMs) by introducing GenderBench, an evaluation suite with 14 probes for 19 harmful behaviors, and found that LLMs consistently exhibit issues like stereotypical reasoning and discriminatory behavior in high-stakes scenarios.
We present GenderBench -- a comprehensive evaluation suite designed to measure gender biases in LLMs. GenderBench includes 14 probes that quantify 19 gender-related harmful behaviors exhibited by LLMs. We release GenderBench as an open-source and extensible library to improve the reproducibility and robustness of benchmarking across the field. We also publish our evaluation of 12 LLMs. Our measurements reveal consistent patterns in their behavior. We show that LLMs struggle with stereotypical reasoning, equitable gender representation in generated texts, and occasionally also with discriminatory behavior in high-stakes scenarios, such as hiring.