SEAICLDec 15, 2025

Revisiting the Reliability of Language Models in Instruction-Following

arXiv:2512.14754v13 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable AI services for users who rely on consistent model behavior across subtle phrasing changes, though it is incremental in focusing on a specific aspect of reliability.

The paper tackles the problem that language models' high benchmark accuracy does not ensure reliability in real-world use, where nuanced prompt variations can cause performance drops of up to 61.8%, as shown through a new evaluation on IFEval++.

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes