LongReasonArena: A Long Reasoning Benchmark for Large Language Models
This addresses the problem of assessing long reasoning capabilities for LLM developers and researchers, though it is incremental as it builds on existing long-context benchmarks.
The authors tackled the lack of benchmarks for evaluating long reasoning abilities in LLMs by introducing LongReasonArena, a benchmark that scales reasoning tasks up to 1 million tokens, where models like Deepseek-R1 achieved only 7.5% accuracy.
Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.