CLFeb 5

Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

arXiv:2602.05932v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of understanding language-induced variation in value assessments for researchers and developers of multilingual AI systems, but it is incremental as it builds on prior work on multilingual LLMs by focusing on value-laden questions.

The paper investigates whether multilingual large language models (LLMs) give consistent answers to value-laden multiple-choice questions across different languages, finding that while larger, instruction-tuned models show higher overall consistency, their responses vary greatly across questions, with some eliciting total agreement and others split answers, and language-specific behavior occurs selectively in consistent models.

Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes