CLAIOct 16, 2024

Tuning Language Models by Mixture-of-Depths Ensemble

arXiv:2410.13077v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the efficiency and performance limitations in LLM tuning for researchers and practitioners, though it is incremental as it builds on existing tuning methods.

The paper tackles the problem of Transformer-based LLMs overlooking predictive power in intermediate layers by introducing the Mixture-of-Depths (MoD) tuning framework, which trains late layers as ensembles and achieves consistent improvements on language modeling tasks with significantly fewer trainable parameters.

Transformer-based Large Language Models (LLMs) traditionally rely on final-layer loss for training and final-layer representations for predictions, potentially overlooking the predictive power embedded in intermediate layers. Surprisingly, we find that focusing training efforts on these intermediate layers can yield training losses comparable to those of final layers, with complementary test-time performance. We introduce a novel tuning framework, Mixture-of-Depths (MoD), which trains late layers as ensembles contributing to the final logits through learned routing weights. With the auxiliary distillation loss and additional normalization modules, we ensure that the outputs of the late layers adapt to language modeling. Our MoD framework, which can be integrated with any existing tuning method, shows consistent improvement on various language modelling tasks. Furthermore, by replacing traditional trainable modules with MoD, our approach achieves similar performance with significantly fewer trainable parameters, demonstrating the potential of leveraging predictive power from intermediate representations during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes