LGAICLCVNov 6, 2023

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

arXiv:2311.03426v29 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses efficiency and scalability issues for researchers and practitioners using large transformer models, though it appears incremental as it builds on existing grouping techniques.

The paper tackled the challenges of slow and computationally intensive pre-training and over-parametrization in transformer-based models by proposing GQKVA, a method that groups queries, keys, and values to speed up pre-training and reduce model size. For example, in image classification with ViT, it achieved about a 0.3% increase in accuracy while reducing model size by about 4%, and in an aggressive reduction, it reduced model size by approximately 15% with only around a 1% drop in accuracy.

Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes