CL LGSep 13, 2019

Neural Machine Translation with 4-Bit Precision and Beyond

arXiv:1909.06091v20.67 citations

Originality Incremental advance

AI Analysis

This work addresses the resource-intensive nature of NMT for deployment on limited hardware, though it is incremental as it builds on existing quantization techniques.

The authors tackled the problem of compressing neural machine translation models for resource-constrained devices by developing a quantization procedure using logarithmic quantization and error-feedback retraining, achieving compression up to 4-bit precision without noticeable quality degradation and up to binary precision with lower quality.

Neural Machine Translation (NMT) is resource intensive. We design a quantization procedure to compress NMT models better for devices with limited hardware capability. Because most neural network parameters are near zero, we employ logarithmic quantization in lieu of fixed-point quantization. However, we find bias terms are less amenable to log quantization but note they comprise a tiny fraction of the model, so we leave them uncompressed. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. The RNN architecture seems to be more robust to quantization, compared to the Transformer.

View on arXiv PDF

Similar