CLAILGOct 18, 2022

Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

arXiv:2211.01979v1296 citationsh-index: 60
Originality Highly original
AI Analysis

This work addresses the need for efficient adaptation of large pretrained models for downstream tasks, offering a novel approach that improves performance while minimizing computational costs.

The paper tackles the problem of parameter-efficient transfer learning for language models by introducing a tiny-attention adapter that modifies hidden states based on contextual information, outperforming other methods and full fine-tuning on the GLUE benchmark with only 0.05% parameter updates.

Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. Previously proposed adapter architectures are all feed-forward neural networks. In this paper, we investigate the effectiveness of using tiny-attention -- i.e., attention with extremely small per-head dimensionality -- as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters. Moreover, we view its multiple attention heads as a mixture of experts and propose to average their weights during deployment, which further reduces its inference computation cost. On the GLUE benchmark, our tiny-attention adapter outperforms the other parameter-efficient transfer learning methods as well as full fine-tuning while only updating 0.05% of the parameters. On the FewGLUE benchmark, its performance is comparable to that of GPT-3 and PET.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes