CLAIJul 21, 2025

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

arXiv:2507.15707v12 citationsh-index: 24ACL
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of inconsistent LLM evaluation for researchers and practitioners, but it is incremental as it focuses on known variability in question formats.

The study investigated how different question types affect large language model (LLM) performance on reasoning tasks, finding significant variations in accuracy across question types and that reasoning accuracy does not always correlate with final answer selection.

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes