CLJun 10, 2021

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Tyler A. Chang, Yifan Xu, Weijian Xu, Zhuowen Tu

arXiv:2106.05505v131.6716 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the problem of enhancing position encoding in pre-trained language models for NLP researchers, offering incremental improvements by integrating convolutional methods into self-attention.

The paper tackles the relationship between convolutions and self-attention in language models, showing that relative position embeddings are equivalent to dynamic lightweight convolutions and proposing composite attention, which improves BERT performance on downstream tasks by replacing absolute position embeddings.

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in self-attention layers.

View on arXiv PDF Code

Similar