LG AIFeb 5

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao

arXiv:2602.17686v31.4h-index: 17

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient CoT distillation for smaller models, offering a method to enhance reasoning capabilities while reducing verbosity, though it appears incremental as it builds on existing curriculum learning and optimization techniques.

The paper tackled the challenge of distilling Chain-of-Thought reasoning from large to compact language models by developing a three-stage curriculum learning framework, resulting in an 11.29% accuracy improvement and 27.4% output length reduction on GSM8K for a 3B-parameter model.

Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

View on arXiv PDF

Similar