CL LGJun 10, 2021

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi

arXiv:2106.05822v10.72 citations

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in NLP models for researchers and practitioners, but it is incremental as it builds on existing Transformer architectures.

The paper tackles the high computational costs of Transformer-based language models by proposing GroupBERT, which modifies the Transformer layer with a convolutional module and grouped transformations to decouple local and global interactions and reduce FLOPs and training time, demonstrating superior performance and efficiency compared to BERT models.

Attention based language models have become a critical component in state-of-the-art natural language processing systems. However, these models have significant computational requirements, due to long training times, dense operations and large parameter count. In this work we demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. First, we add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. Secondly, we rely on grouped transformations to reduce the computational cost of dense feed-forward layers and convolutions, while preserving the expressivity of the model. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales. We further highlight its improved efficiency, both in terms of floating-point operations (FLOPs) and time-to-train.

View on arXiv PDF

Similar