CLAug 3, 2023

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

arXiv:2308.02019v2155 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses the problem of sample efficiency for language model training, showing distillation can outperform teachers on small datasets, though it appears incremental as it builds on established distillation techniques.

The authors tackled the problem of improving language model sample efficiency by distilling knowledge from an ensemble of teachers (GPT-2 and small LLaMA models) trained on a 10M-word dataset into a 58M-parameter LLaMA model, which exceeded the performance of both teachers and a directly trained model.

We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes