Efficient GPT Model Pre-training using Tensor Train Matrix Representation
This work addresses the high computational and storage demands of large transformer models for AI practitioners, though it is incremental as it applies an existing compression technique to a specific architecture.
The authors tackled the problem of reducing the massive parameter count in GPT-2 models to lower deployment and training costs by replacing fully-connected layers with Tensor Train Matrix (TTM) structures, resulting in a model with up to 40% fewer parameters while maintaining comparable perplexity and similar performance on downstream tasks like language understanding and text summarization.
Large-scale transformer models have shown remarkable performance in language modelling tasks. However, such models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of the parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Tensor Train Matrix~(TTM) structure. Finally, we customize forward and backward operations through the TTM-based layer for simplicity and the stableness of further training. % The resulting GPT-2-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model. On the downstream tasks, including language understanding and text summarization, the model performs similarly to the original GPT-2 model. The proposed tensorized layers could be used to efficiently pre-training other Transformer models.