LGAIOct 20, 2025

I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models

arXiv:2510.17496v24 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work addresses the need for better benchmarks to assess reasoning capabilities in AI models, though it is incremental as it builds on an existing benchmark.

The paper tackled the problem of evaluating generalization and robustness in analogical and mathematical reasoning for large language and reasoning models by introducing I-RAVEN-X, a benchmark that extends I-RAVEN with increased complexity and perceptual uncertainty, finding that LRMs show improved productivity and systematicity but struggle with reasoning under uncertainty.

We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes