CLMar 28, 2024

Large Language Models Struggle with Unreasonability in Math Problems

Jingyuan Ma, Damai Dai, Zihang Yuan, Rui li, Weilin Luo, Bin Wang, Qun Liu, Lei Sha, Zhifang Sui

arXiv:2403.19346v67.711 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses a vulnerability in LLMs for math applications, though it is incremental as it focuses on a specific failure mode.

The researchers tackled the problem of large language models (LLMs) struggling with unreasonable math problems by creating the Unreasonable Math Problems (UMP) benchmark, finding that even state-of-the-art models like GPT-4o only scored 0.6 on it.

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the \textbf{Unreasonable Math Problems (UMP)} benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o achieve only a score of 0.6 on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further explore prompting and fine-tuning methods, which offer partial improvements but also introduce trade-offs, shedding light on both the potential and limitations of LLMs in this challenging setting.

View on arXiv PDF

Similar