CLDec 12, 2023Code
SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language ModelsManish Nagireddy, Lamogha Chiazor, Moninder Singh et al.
Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. Taking inspiration from social science research, we start with a documented list of 93 US-centric stigmas and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two open source generative language models and we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We demonstrate that the deliberate design of the templates in our benchmark (e.g., adding biasing text to the prompt or using different verbs that change the answer that indicates bias) impacts the model tendencies to generate socially biased output. Additionally, through manual evaluation, we discover problematic patterns in the generated chain-of-thought output that range from subtle bias to lack of reasoning. Warning: This paper contains examples of text which are toxic, biased, and potentially harmful.
LGMar 9, 2024
Detectors for Safe and Reliable LLMs: Implementations, Uses, and LimitationsSwapnaja Achintalwar, Adriana Alvarado Garcia, Ateret Anaby-Tavor et al. · ibm-research
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.
CLJul 8, 2024
Epistemological Bias As a Means for the Automated Detection of Injustices in TextKenya Andrews, Lamogha Chiazor
Injustices in text are often subtle since implicit biases or stereotypes frequently operate unconsciously due to the pervasive nature of prejudice in society. This makes automated detection of injustices more challenging which leads to them being often overlooked. We introduce a novel framework that combines knowledge from epistemology to enhance the detection of implicit injustices in text using NLP models to address these complexities and offer explainability. Our empirical study shows how our framework can be applied to effectively detect these injustices. We validate our framework using a human baseline study which mostly agrees with the choice of implicit bias, stereotype, and sentiment. The main feedback from the study was the extended time required to analyze, digest, and decide on each component of our framework. This highlights the importance of our automated framework pipeline that assists users in detecting implicit injustices while offering explainability and reducing time burdens on humans.
CLMay 20, 2025
DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias AnalysisPrashanth Vijayaraghavan, Soroush Vosoughi, Lamogha Chiazor et al.
Recent advancements in large language models (LLMs) have revolutionized natural language processing (NLP) and expanded their applications across diverse domains. However, despite their impressive capabilities, LLMs have been shown to reflect and perpetuate harmful societal biases, including those based on ethnicity, gender, and religion. A critical and underexplored issue is the reinforcement of caste-based biases, particularly towards India's marginalized caste groups such as Dalits and Shudras. In this paper, we address this gap by proposing DECASTE, a novel, multi-dimensional framework designed to detect and assess both implicit and explicit caste biases in LLMs. Our approach evaluates caste fairness across four dimensions: socio-cultural, economic, educational, and political, using a range of customized prompting strategies. By benchmarking several state-of-the-art LLMs, we reveal that these models systematically reinforce caste biases, with significant disparities observed in the treatment of oppressed versus dominant caste groups. For example, bias scores are notably elevated when comparing Dalits and Shudras with dominant caste groups, reflecting societal prejudices that persist in model outputs. These results expose the subtle yet pervasive caste biases in LLMs and emphasize the need for more comprehensive and inclusive bias evaluation methodologies that assess the potential risks of deploying such models in real-world contexts.