CLAILGFeb 18, 2025

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

arXiv:2502.12459v34 citationsh-index: 6EMNLP
AI Analysis

This reveals that LLMs rely on superficial cues rather than robust representations, which is a critical problem for AI reliability and generalization in real-world applications, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of assessing Large Language Models' generalization ability under controlled perturbations, finding that despite high benchmark scores, LLMs exhibit severe accuracy drops and biases, such as Qwen 2.5 1.5B's MMLU score dropping from 89 to 36 with option length changes and GPT4o experiencing a 25-point loss with problem type changes.

In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes