CLAILGMar 14, 2024

Fisher Mask Nodes for Language Model Merging

arXiv:2403.09891v384 citationsLREC
Originality Incremental advance
AI Analysis

This addresses the challenge of multi-task learning in NLP by enabling efficient model merging, though it is incremental as it builds on existing Fisher-weighted averaging and pruning techniques.

The paper tackles the problem of merging multiple task-specific fine-tuned language models into a single multi-task model by introducing a novel method that uses Fisher information of mask nodes in Transformers for efficient weighted averaging. The result is a significant performance increase of up to +6.5 over baselines and computational speedups of 57.4x to 321.7x across BERT-family models.

Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup between 57.4x and 321.7x across models. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes