CV AIMay 16, 2018

PACT: Parameterized Clipping Activation for Quantized Neural Networks

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan

arXiv:1805.06085v243.31113 citations

Originality Highly original

AI Analysis

This addresses the problem of computational efficiency for AI practitioners by enabling ultra low precision inference with super-linear performance improvements in hardware, though it is incremental as it builds on existing quantization methods.

The paper tackles the high computation cost of deep learning by proposing PACT, a novel quantization scheme for activations during training, enabling neural networks to work with 4-bit weights and activations without significant accuracy loss, achieving comparable accuracy to full precision networks across various models and datasets.

Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations. This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation. This technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $α$ that is optimized during training to find the right quantization scale. PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes. We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets. We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories.

View on arXiv PDF

Similar