CLDec 31, 2025

Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Ákos Prucs, Márton Csutora, Mátyás Antal, Márk Marosi

arXiv:2512.24776v1h-index: 1Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for compute-aware model selection in industrial applications, though it is incremental as it applies existing evaluation methods to new models and benchmarks.

The study tackled the problem of evaluating large language models (LLMs) for reasoning tasks by considering both accuracy and computational cost, finding that Mixture of Experts (MoE) architectures balance performance and efficiency, and identifying a saturation point where additional compute yields diminishing accuracy gains.

Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

View on arXiv PDF

Similar