Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
This work provides a drop-in fix for compressing reasoning models, which is significant for scaling deployment but is incremental as it builds on existing pruning workflows like SparseGPT.
The paper tackles the problem of high deployment costs for reasoning language models like DeepSeek-R1 by addressing performance loss from standard pruning methods, and introduces Reasoning-Aware Compression (RAC) that improves pruning performance by reconstructing chain-of-thought traces.
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC