CL AIOct 26, 2025

Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, Mark V. Albert

arXiv:2510.22823v12 citationsh-index: 2

Originality Incremental advance

AI Analysis

It addresses the cost-reliability trade-off for humanitarian organizations needing multilingual human rights monitoring, providing practical guidance for deployment.

This paper systematically compares commercial and open-weight large language models for human-rights-violation detection across seven languages, finding that aligned models maintain near-invariant accuracy and balanced calibration across languages, while open-weight models show significant prompt-language sensitivity and calibration drift.

Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.

View on arXiv PDF

Similar