Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition
This work addresses the challenge of efficient and accurate matrix multiplication in computational tasks, with potential applications in areas like large language model optimization, though it is incremental as it builds on existing quantization methods.
The paper tackles the problem of minimizing mean-squared error in matrix multiplication by optimally quantizing the entries of two matrices before multiplication, deriving a closed-form optimal quantization density for correlated Gaussian pairs and identifying a correlation-driven phase transition where the density shifts from unimodal to bimodal beyond a threshold of |ρ| > 1/√3.
We study entrywise scalar quantization of two matrices prior to multiplication. Given $A\in R^{m\times k}$ and $B\in R^{k\times n}$, we quantize entries of $A$ and $B$ independently using scalar quantizers with $K_X$ and $K_Y$ levels per entry, and form $\widehat C=\widehat A\,\widehat B$. The objective is to minimize the matrix multiplication mean-squared error (MSE) $E[\|{AB-\widehat A\widehat B}\|_F^2]$ under a pair-i.i.d.\ inner-product model. In the high-resolution regime $K_X,K_Y\to\infty$, we derive a sharp $K^{-2}$ asymptotic expansion for $\mathcal{E}$, identify the exact optimal leading constants, and characterize asymptotically optimal quantization center densities in terms of conditional second moments. We then specialize to correlated Gaussian multiplicative pairs, obtaining a closed-form optimal point density \[ λ^\star(u)\ \propto\ \exp\!\left(-\frac{u^2}{6}\right)\bigl((1-Ï^2)+Ï^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{Ï_X}, \] with the same form for $y/Ï_Y$, and prove a correlation-driven phase transition: the density is unimodal at the origin for $|Ï|\leq 1/\sqrt{3}$ and becomes bimodal for $|Ï|>1/\sqrt{3}$ with peaks at $u_{\mathrm{peak}}=\pm\sqrt{3-1/Ï^2}$. We show our method's applicability in synthetic experiments such as matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.