CLAINov 25, 2024

When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?

arXiv:2411.16487v114 citations
Originality Incremental advance
AI Analysis

This work addresses data efficiency in language model training, offering a computational reduction, but it appears incremental as it builds upon deep mutual learning with a novel weighting strategy.

The paper tackles the problem of data-efficient language model pretraining by introducing a weighted mutual learning method that eliminates the need for a teacher model, showing that teacher-less approaches can match or surpass teacher-supervised methods in evaluations.

We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes