CL AIApr 14, 2025

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu

arXiv:2504.10368v320.916 citationsh-index: 7Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of assessing intuitive thinking capabilities in AI models for researchers and developers, though it is incremental as it focuses on benchmarking rather than model improvement.

The authors tackled the lack of benchmarks for evaluating system 1 thinking in Large Reasoning Models by introducing S1-Bench, a simple and diverse benchmark across multiple domains and languages, and found that 28 LRMs showed inefficiency, inadequate accuracy, and limited robustness on these tasks.

We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.

View on arXiv PDF Code

Similar