LGAISep 26, 2025

Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

arXiv:2509.22921v12 citationsh-index: 30
Originality Highly original
AI Analysis

This work addresses reward-aware distillation for resource-constrained settings, offering a theoretically grounded and efficient solution, though it is incremental as it builds on existing integration of task-specific rewards.

The authors tackled the problem of large language model distillation by formulating it as a constrained reinforcement learning problem, achieving better constraint satisfaction rates and reasoning performance compared to baselines while maintaining competitive task performance on mathematical reasoning tasks.

We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes