CLOct 20, 2024

A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

arXiv:2410.15464v24.84 citationsh-index: 2Has CodePACLIC

Originality Incremental advance

AI Analysis

This work addresses the need for interpretability in bias analysis for language models, particularly for underrepresented Southeast Asian languages, though it is incremental in focusing on a new metric rather than a broader solution.

The authors tackled the problem of bias attribution and explainability in pretrained language models by proposing a novel metric, the bias attribution score, based on information theory to measure token-level contributions to biased behavior. They applied this metric to multilingual models from Southeast Asia, confirming the presence of sexist and homophobic bias and identifying specific topics like crime and relationships where bias is strongly reproduced.

Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the $\textit{bias attribution score}$, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.

View on arXiv PDF Code

Similar