CL AIJul 21, 2025

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Seok Hwan Song, Mohna Chakraborty, Qi Li, Wallapak Tavanapong

arXiv:2507.15707v16.72 citationsh-index: 24ACL

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of inconsistent LLM evaluation for researchers and practitioners, but it is incremental as it focuses on known variability in question formats.

The study investigated how different question types affect large language model (LLM) performance on reasoning tasks, finding significant variations in accuracy across question types and that reasoning accuracy does not always correlate with final answer selection.

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

View on arXiv PDF

Similar