CLSep 17, 2020

Towards Fully 8-bit Integer Inference for the Transformer Model

arXiv:2009.08034v267 citations
AI Analysis

This work addresses the need for efficient inference in complex models like Transformers, offering a practical solution for deployment in resource-constrained environments, though it is incremental as it builds on existing quantization techniques.

The paper tackled the problem of reducing latency and storage in deep neural networks by developing a fully 8-bit integer inference method for Transformer models, achieving comparable performance to floating-point baselines with nearly 4x less memory footprint on translation and language modeling tasks.

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes