LG AIMay 29, 2025

Model-Preserving Adaptive Rounding

Albert Tseng, Zhaofeng Sun, Christopher De Sa

arXiv:2505.22988v28 citationsh-index: 6

Originality Highly original

AI Analysis

This work addresses the challenge of model compression for efficient deployment in AI systems, offering a significant improvement over prior quantization techniques.

The paper tackles the problem of quantization by introducing YAQA, an adaptive rounding algorithm that directly minimizes end-to-end error, achieving a 30% reduction in error over existing methods and state-of-the-art performance on downstream tasks with no inference overhead.

The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.

View on arXiv PDF

Similar