AICLSep 26, 2025

The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging

arXiv:2509.22034v22 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the need for efficient methods to produce LLMs with specific reasoning profiles for real-world applications, though it appears incremental as it empirically studies existing merging techniques rather than introducing new ones.

This paper tackles the problem of creating large language models with tunable reasoning capabilities by evaluating model merging techniques across multiple reasoning benchmarks, finding that merging offers an effective method for calibrating the trade-off between reasoning accuracy and token efficiency, with instances where merged models achieve both higher accuracy and lower token consumption than parent models.

The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes