CL AI LG MLJun 18, 2018

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, Cho-jui Hsieh

arXiv:1806.06950v17.078 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of deploying large language models on resource-constrained devices or in real-time applications, offering a domain-specific compression solution that is incremental but effective.

The paper tackles the problem of compressing large neural language models, specifically targeting the embedding and softmax matrices that dominate model size, by proposing GroupReduce, a block-wise low-rank approximation method based on vocabulary partitioning and token frequency distribution. The result is a 6.6x compression rate for these matrices on the One-Billion-Word dataset, with up to 26x when combined with quantization, leading to minimal perplexity degradation.

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. As a case study, a state-of-the-art neural language model usually consists of one or more recurrent layers sandwiched between an embedding layer used for representing input tokens and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of- the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6 times compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26 times compression rate, which translates to a factor of 12.8 times compression for the entire model with very little degradation in perplexity.

View on arXiv PDF

Similar