LGAIMay 29, 2025

Model-Preserving Adaptive Rounding

arXiv:2505.22988v28 citationsh-index: 6
Originality Highly original
AI Analysis

This work addresses the challenge of model compression for efficient deployment in AI systems, offering a significant improvement over prior quantization techniques.

The paper tackles the problem of quantization by introducing YAQA, an adaptive rounding algorithm that directly minimizes end-to-end error, achieving a 30% reduction in error over existing methods and state-of-the-art performance on downstream tasks with no inference overhead.

The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes