CLFeb 25, 2025

LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang

arXiv:2502.17848v42 citationsh-index: 4ACL

Originality Incremental advance

AI Analysis

This addresses the lack of appropriate benchmarks for assessing reflective reasoning in LLMs, which is crucial for advancing their problem-solving capabilities, though it is incremental as it focuses on evaluation rather than model improvement.

The paper tackles the problem of evaluating reflective reasoning in large language models by introducing LR^2Bench, a benchmark with 850 constraint satisfaction problems, and finds that advanced models like DeepSeek-R1 and OpenAI o1-preview achieve low average Exact Match scores of only 20.0% and 23.6%, indicating poor performance.

Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. Our extensive evaluation on both conventional LLMs and LRMs reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs.

View on arXiv PDF

Similar