Why Do Safety Guardrails Degrade Across Languages?

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

arXiv:2605.1717325.0

Predicted impact top 70% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners evaluating cross-lingual LLM safety, this work provides a method to disentangle confounding factors and identify specific vulnerabilities that aggregate metrics obscure.

The paper introduces a Multi-Group Item Response Theory framework to decouple factors driving safety degradation in LLMs across languages, revealing that safety failures are not solely due to low-resource languages and that the framework achieves AUC=0.940 in predicting safe refusal.

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($θ$), intrinsic prompt hardness ($β$), global language processing difficulty ($γ$), and a prompt-specific cross-lingual safety gap ($τ$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$τ$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $τ$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $τ$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

View on arXiv PDF

Similar