LGMLOct 9, 2023

Transformer Fusion with Optimal Transport

ETH Zurich
arXiv:2310.05719v338 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently combining transformer models for practitioners in AI, enabling heterogeneous fusion and compression, though it is incremental as it extends existing fusion techniques to transformers.

The paper tackles the problem of merging multiple independently-trained transformer networks by introducing a systematic fusion method using Optimal Transport for soft alignment of architectural components, achieving consistent performance improvements over vanilla fusion and individual parent models after short finetuning on image classification and natural language tasks.

Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes