CLMay 22, 2025

R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

Yibo Wang, Haotian Luo, Huanjin Yao, Tiansheng Huang, Haiying He, Rui Liu, Naiqiang Tan, Jiaxing Huang, Xiaochun Cao, Dacheng Tao, Li Shen

arXiv:2505.16838v224.525 citationsh-index: 34Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of high computational costs for users of LLMs employing long reasoning chains, though it appears incremental as it builds on existing compression approaches.

The paper tackles the computational overhead of Long Chain-of-Thought reasoning in large language models by proposing R1-Compress, a chunk-level compression framework that reduces token usage by about 20% while maintaining reasoning accuracy, achieving 92.4% accuracy on MATH500 with only a 0.6% drop compared to the baseline.

Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress

View on arXiv PDF Code

Similar