CLMay 28, 2025

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

arXiv:2505.23833v13 citationsh-index: 24Has CodeICML
Originality Incremental advance
AI Analysis

This addresses the need for rigorous evaluation of abstract reasoning in LLMs, though it is incremental as it builds on existing benchmarking approaches.

The paper tackles the problem of assessing abstract reasoning in Large Language Models by developing a theoretically grounded benchmark with two novel metrics, revealing that current LLMs lack robust abstract reasoning with critical limitations in non-decimal arithmetic and symbolic reasoning despite chain-of-thought prompting.

In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: \(\scoreGamma\) measures basic reasoning accuracy, while \(\scoreDelta\) quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) \(\scoreDelta\)'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes