CLAILGMay 8, 2025

Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

arXiv:2505.07861v25 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the challenge of high resource demands for math reasoning in LLMs, offering a practical solution for deploying efficient models without sacrificing performance, though it is incremental as it builds on existing efficient inference techniques.

The paper tackles the problem of computational inefficiency in large language models (LLMs) for math reasoning by proposing Caprese, a low-rank distillation method that recovers lost math capabilities from efficient inference methods, achieving reductions such as ~2B fewer active parameters and >16% faster token generation with minimal training data.

Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>16% time-to-next-token reduction) while encouraging response brevity (up to 8.5% fewer tokens).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes