LGAICCMar 2, 2023

Ternary Quantization: A Survey

arXiv:2303.01505v17 citationsh-index: 14
Originality Synthesis-oriented
AI Analysis

It provides a comprehensive overview for researchers and practitioners interested in model compression, but is incremental as it synthesizes existing work without introducing new methods.

This survey reviews ternary quantization methods for compressing deep neural networks to improve inference speed and model size, examining their evolution and relationships based on projection functions and optimization techniques.

Inference time, model size, and accuracy are critical for deploying deep neural network models. Numerous research efforts have been made to compress neural network models with faster inference and higher accuracy. Pruning and quantization are mainstream methods to this end. During model quantization, converting individual float values of layer weights to low-precision ones can substantially reduce the computational overhead and improve the inference speed. Many quantization methods have been studied, for example, vector quantization, low-bit quantization, and binary/ternary quantization. This survey focuses on ternary quantization. We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods from the perspective of projection function and optimization methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes