LGAICLApr 10

Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

arXiv:2604.0888078.21 citationsh-index: 4
AI Analysis

This work addresses practical challenges in selecting teacher-student pairs for chain-of-thought distillation, offering guidance for researchers and practitioners in model compression.

The paper revisits the capacity gap in chain-of-thought distillation, finding that it often degrades student performance compared to pre-distillation baselines, and proposes a more realistic evaluation protocol that shows the gap's impact varies across tasks and settings.

Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes