CLOct 3, 2025

Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Cambridge
arXiv:2510.03136v16 citationsh-index: 10
Originality Highly original
AI Analysis

This addresses the problem of unreliable confidence estimates in multilingual LLMs for global deployment, offering a practical solution to improve trustworthiness.

The study found that non-English languages in large language models suffer from systematically worse confidence calibration than English, and proposed training-free methods like Language-Aware Confidence Ensemble (LACE) that use intermediate layers to improve multilingual calibration, achieving up to 30% reduction in calibration error across 100+ languages.

Confidence calibration, the alignment of a model's predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model's internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes