CLJun 10, 2021

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

arXiv:2106.05505v1716 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of enhancing position encoding in pre-trained language models for NLP researchers, offering incremental improvements by integrating convolutional methods into self-attention.

The paper tackles the relationship between convolutions and self-attention in language models, showing that relative position embeddings are equivalent to dynamic lightweight convolutions and proposing composite attention, which improves BERT performance on downstream tasks by replacing absolute position embeddings.

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in self-attention layers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes