AICLLGApr 2, 2025

Exploring LLM Reasoning Through Controlled Prompt Variations

arXiv:2504.02111v113 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses robustness issues in LLMs for real-world applications, though it is incremental as it builds on existing evaluation methods.

The study investigated how large language models (LLMs) handle reasoning on mathematical problems when faced with systematic prompt perturbations, finding that irrelevant context significantly degrades performance, with vulnerabilities not strictly tied to model size or task complexity.

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes