CL AI LGFeb 18, 2025

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang

arXiv:2502.12459v310.94 citationsh-index: 6EMNLP

Originality Incremental advance

AI Analysis

This reveals that LLMs rely on superficial cues rather than robust representations, which is a critical problem for AI reliability and generalization in real-world applications, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of assessing Large Language Models' generalization ability under controlled perturbations, finding that despite high benchmark scores, LLMs exhibit severe accuracy drops and biases, such as Qwen 2.5 1.5B's MMLU score dropping from 89 to 36 with option length changes and GPT4o experiencing a 25-point loss with problem type changes.

In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.

View on arXiv PDF

Similar