CVCCMar 24, 2021

DNN Quantization with Attention

arXiv:2103.13322v12 citations
Originality Incremental advance
AI Analysis

This addresses the problem of accuracy loss in quantized DNNs for applications requiring reduced memory and energy, but it is incremental as it builds on existing quantization techniques.

The paper tackles the accuracy drop in low-bit quantization of DNNs by proposing a training procedure called DNN Quantization with Attention (DQA), which uses a learnable linear combination of high, medium, and low-bit quantizations with temperature scheduling, achieving almost the same accuracy as full precision DNNs on benchmarks like CIFAR10, CIFAR100, and ImageNet ILSVRC 2012.

Low-bit quantization of network weights and activations can drastically reduce the memory footprint, complexity, energy consumption and latency of Deep Neural Networks (DNNs). However, low-bit quantization can also cause a considerable drop in accuracy, in particular when we apply it to complex learning tasks or lightweight DNN architectures. In this paper, we propose a training procedure that relaxes the low-bit quantization. We call this procedure \textit{DNN Quantization with Attention} (DQA). The relaxation is achieved by using a learnable linear combination of high, medium and low-bit quantizations. Our learning procedure converges step by step to a low-bit quantization using an attention mechanism with temperature scheduling. In experiments, our approach outperforms other low-bit quantization techniques on various object recognition benchmarks such as CIFAR10, CIFAR100 and ImageNet ILSVRC 2012, achieves almost the same accuracy as a full precision DNN, and considerably reduces the accuracy drop when quantizing lightweight DNN architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes