CLOct 25, 2023

Muslim-Violence Bias Persists in Debiased GPT Models

Babak Hemmatian, Razan Baltaji, Lav R. Varshney

arXiv:2310.18368v22.57 citationsh-index: 34

Originality Incremental advance

AI Analysis

This highlights a persistent and potentially worsening bias in AI language models, posing risks for fairness and safety in applications, though it is incremental as it builds on prior bias detection work.

The study investigated whether debiased GPT models still exhibit anti-Muslim bias in generated completions, finding that using common names in prompts significantly increases violent completions for Muslims, with ChatGPT showing stronger bias than InstructGPT.

Abid et al. (2021) showed a tendency in GPT-3 to generate mostly violent completions when prompted about Muslims, compared with other religions. Two pre-registered replication attempts found few violent completions and only a weak anti-Muslim bias in the more recent InstructGPT, fine-tuned to eliminate biased and toxic outputs. However, more pre-registered experiments showed that using common names associated with the religions in prompts increases several-fold the rate of violent completions, revealing a significant second-order anti-Muslim bias. ChatGPT showed a bias many times stronger regardless of prompt format, suggesting that the effects of debiasing were reduced with continued model development. Our content analysis revealed religion-specific themes containing offensive stereotypes across all experiments. Our results show the need for continual de-biasing of models in ways that address both explicit and higher-order associations.

View on arXiv PDF

Similar