CLMay 21, 2023

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

arXiv:2305.12458v13 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of high latency and computational costs in large language models for NLP applications, offering a promising compression and acceleration approach.

The paper tackles the computational inefficiency of Transformer-based language models by proposing Infor-Coef, a method combining dynamic token downsampling and static pruning optimized with information bottleneck loss, achieving an 18x FLOPs speedup with less than 8% accuracy degradation compared to BERT.

The prevalence of Transformer-based pre-trained language models (PLMs) has led to their wide adoption for various natural language processing tasks. However, their excessive overhead leads to large latency and computational costs. The statically compression methods allocate fixed computation to different samples, resulting in redundant computation. The dynamic token pruning method selectively shortens the sequences but are unable to change the model size and hardly achieve the speedups as static pruning. In this paper, we propose a model accelaration approaches for large language models that incorporates dynamic token downsampling and static pruning, optimized by the information bottleneck loss. Our model, Infor-Coef, achieves an 18x FLOPs speedup with an accuracy degradation of less than 8\% compared to BERT. This work provides a promising approach to compress and accelerate transformer-based models for NLP tasks.

View on arXiv PDF

Similar