LGCLMLFeb 7, 2019

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

arXiv:1902.02671v2131 citations
AI Analysis

This work addresses the computational cost and parameter redundancy for researchers and practitioners in NLP by providing an incremental improvement in multi-task adaptation efficiency.

The paper tackles the problem of parameter inefficiency in multi-task learning for natural language understanding by introducing projected attention layers (PALs), which enable a single BERT model to match the performance of separately fine-tuned models on the GLUE benchmark with about 7 times fewer parameters and achieve state-of-the-art results on the Recognizing Textual Entailment dataset.

Multi-task learning shares information between related tasks, sometimes reducing the number of parameters required. State-of-the-art results across multiple natural language understanding tasks in the GLUE benchmark have previously used transfer from a single large task: unsupervised pre-training with BERT, where a separate BERT model was fine-tuned for each task. We explore multi-task approaches that share a single BERT model with a small number of additional task-specific parameters. Using new adaptation modules, PALs or `projected attention layers', we match the performance of separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes