CLLGApr 8, 2020

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

arXiv:2004.04124v21013 citations
AI Analysis

This addresses the deployment bottleneck for BERT in real-time applications by providing a more efficient compression method, though it is incremental as it builds on existing techniques.

The paper tackles the problem of BERT's high memory usage and latency in online services by proposing LadaBERT, a hybrid model compression method combining weight pruning, matrix factorization, and knowledge distillation, which achieves state-of-the-art accuracy on public datasets and reduces training overhead by an order of magnitude.

BERT is a cutting-edge language representation model pre-trained by a large corpus, which achieves superior performances on various natural language understanding tasks. However, a major blocking issue of applying BERT to online services is that it is memory-intensive and leads to unsatisfactory latency of user requests, raising the necessity of model compression. Existing solutions leverage the knowledge distillation framework to learn a smaller model that imitates the behaviors of BERT. However, the training procedure of knowledge distillation is expensive itself as it requires sufficient training data to imitate the teacher model. In this paper, we address this issue by proposing a hybrid solution named LadaBERT (Lightweight adaptation of BERT through hybrid model compression), which combines the advantages of different model compression methods, including weight pruning, matrix factorization and knowledge distillation. LadaBERT achieves state-of-the-art accuracy on various public datasets while the training overheads can be reduced by an order of magnitude.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes