AICLFeb 18

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

arXiv:2602.16832v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses safety risks for South Asian users who code-switch and romanize, by exposing multilingual vulnerabilities hidden in English-only evaluations, though it is incremental as it extends existing jailbreak benchmarking to new languages.

The paper tackled the understudied vulnerability of large language models to jailbreak attacks in South Asian languages by introducing the Indic Jailbreak Robustness benchmark, revealing that models like LLaMA and Sarvam exceed 0.92 jailbreak success rates in contract-bound settings and reach 1.0 in naturalistic ones, with attacks transferring strongly from English and orthography affecting outcomes.

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes