LG AI CL CVNov 6, 2023

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad, Kangling Liu, Yang Liu

arXiv:2311.03426v213.09 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses efficiency and scalability issues for researchers and practitioners using large transformer models, though it appears incremental as it builds on existing grouping techniques.

The paper tackled the challenges of slow and computationally intensive pre-training and over-parametrization in transformer-based models by proposing GQKVA, a method that groups queries, keys, and values to speed up pre-training and reduce model size. For example, in image classification with ViT, it achieved about a 0.3% increase in accuracy while reducing model size by about 4%, and in an aggressive reduction, it reduced model size by approximately 15% with only around a 1% drop in accuracy.

Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.

View on arXiv PDF

Similar