Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
This work addresses bias auditing in large language models, which is crucial for developers and users to mitigate harmful outputs, though it is incremental as it builds on existing bias analysis methods.
The paper introduces a framework called 'toxicity rabbit hole' to iteratively elicit toxic content from large language models, auditing bias across 1,266 identity groups in PaLM 2 and other models, with analysis focusing on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia.
This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.