LGJan 14

Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation

arXiv:2601.09165v13 citationsh-index: 3
Originality Incremental advance
AI Analysis

This provides theoretical grounding for multi-teacher distillation from diverse frontier models, though it is incremental as it builds on existing probability-domain distillation frameworks.

The paper tackles the problem of aggregating knowledge from multiple teacher models in knowledge distillation by developing an axiomatic mathematical framework with five core axioms, proving that this framework reduces both stochastic variance and systematic supervisory bias while providing theoretical guarantees like Jensen-type bounds and log-loss guarantees.

Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles. Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes