CLAISep 25, 2025

Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning

arXiv:2509.21487v22 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of slow inference in reasoning-enhanced classifiers for NLP practitioners, offering a method that is incremental but provides practical efficiency gains.

The paper tackles the trade-off between improved classification accuracy from Chain-of-Thought prompting and its high inference cost by introducing Dual-Head Reasoning Distillation, a training method that achieves relative accuracy gains of 0.65-5.47% over baselines on SuperGLUE tasks while matching the inference throughput of standard classifiers.

Chain-of-Thought (CoT) prompting often improves classification accuracy, but it introduces a significant throughput penalty with rationale generation (Wei et al., 2022; Cheng and Van Durme, 2024). To resolve this trade-off, we introduce Dual-Head Reasoning Distillation (DHRD), a simple training method for decoder-only language models (LMs) that adds (i) a pooled classification head used during training and inference and (ii) a reasoning head supervised by teacher rationales used only in training. We train with a loss function that is a weighted sum of label cross-entropy and token-level LM loss over input-plus-rationale sequences. On seven SuperGLUE tasks, DHRD yields relative gains of 0.65-5.47% over pooled baselines, with notably larger gains on entailment/causal tasks. Since we disable the reasoning head at test time, inference throughput matches pooled classifiers and exceeds CoT decoding on the same backbones by 96-142 times in QPS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes