LG AIDec 16, 2025

Arithmetic-Intensity-Aware Quantization

arXiv:2512.14090v2

Originality Incremental advance

AI Analysis

This work addresses the inference throughput bottleneck for memory-bound neural networks, offering a practical solution for deployment, though it is incremental as it builds on existing quantization methods.

The paper tackles the problem of memory-bound neural network inference by proposing Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that increases arithmetic intensity by ~50% on ResNet-20/CIFAR-10 and achieves 1.66x higher throughput on MobileNetV2 while keeping test accuracy within ~1 percentage point.

As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.

View on arXiv PDF

Similar