LGAIMay 12

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

arXiv:2605.1141410.4
Predicted impact top 89% in LG · last 90 daysOriginality Highly original
AI Analysis

For practitioners deploying time-series classifiers under latency or cost constraints, this method enables accurate predictions from partial data without retraining full-sequence models.

This work addresses the challenge of classifying partial time series by distilling knowledge from full-sequence teachers to partial-sequence students. The proposed Generative Diffusion Prior Distillation (GDPD) method achieves significant improvements across multiple datasets and architectures, with up to 15% accuracy gains in early classification settings.

While traditional time-series classifiers assume full sequences at inference, practical constraints (latency and cost) often limit inputs to partial prefixes. The absence of class-discriminative patterns in partial data can significantly hinder a classifier's ability to generalize. This work uses knowledge distillation (KD) to equip partial time series classifiers with the generalization ability of their full-sequence counterparts. In KD, high-capacity teacher transfers supervision to aid student learning on the target task. Matching with teacher features has shown promise in closing the generalization gap due to limited parameter capacity. However, when the generalization gap arises from training-data differences (full versus partial), the teacher's full-context features can be an overwhelming target signal for the student's short-context features. To provide progressive, diverse, and collective teacher supervision, we propose Generative Diffusion Prior Distillation (GDPD), a novel KD framework that treats short-context student features as degraded observations of the target full-context features. Inspired by the iterative restoration capability of diffusion models, we learn a diffusion-based generative prior over teacher features. Leveraging this prior, we posterior-sample target teacher representations that could best explain the missing long-range information in the student features and optimize the student features to be minimally degraded relative to these targets. GDPD provides each student feature with a distribution of task-relevant long-context knowledge, which benefits learning on the partial classification task. Extensive experiments across earliness settings, datasets, and architectures demonstrate GDPD's effectiveness for full-to-partial distillation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes