CLAILGMar 1, 2024

Merging Text Transformer Models from Different Initializations

arXiv:2403.00986v314 citationsh-index: 13Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses the challenge of merging separately trained Transformer models for researchers in natural language processing, though it is incremental as it extends existing permutation-based merging techniques to Transformers.

The paper tackled the problem of merging Transformer models from different initializations by developing a method to compute permutations that maintain functional equivalence, and found consistently lower loss barriers compared to model averaging in masked-language modeling and language understanding tasks.

Recent work on permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging, across models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes