LGApr 6
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and QuantizationPei-Chun Su
We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matrix model. This observation leads to eOptShrinkQ, a two-stage compression pipeline: optimal singular value shrinkage (eOptShrink) automatically extracts the shared structure, and the residual -- which satisfies the \emph{thin shell property} with delocalized coordinates -- is quantized by TurboQuant~\citep{zandieh2025turboquant}, a recently proposed per-vector scalar quantizer with near-optimal distortion guarantees. By restoring the isotropy that scalar quantization assumes, spectral denoising eliminates the need for both outlier handling and dedicated inner product bias correction, freeing those bits for improved reconstruction. The theoretical grounding in random matrix theory provides three guarantees: automatic rank selection via the BBP phase transition, provably near-zero inner product bias on the residual, and coordinate delocalization ensuring near-optimal quantization distortion. Experimentally, we validate eOptShrinkQ on Llama-3.1-8B and Ministral-8B across three levels: per-head MSE and inner product fidelity, where eOptShrinkQ saves nearly one bit per entry over TurboQuant at equivalent quality; end-to-end on LongBench (16 tasks), where eOptShrinkQ at $\sim$2.2 bits per entry outperforms TurboQuant at 3.0 bits; and multi-needle retrieval, where eOptShrinkQ at 2.2 bits closely matches or exceeds uncompressed FP16, suggesting that spectral denoising can act as a beneficial regularizer for retrieval-intensive tasks.
SPMar 13
Data-Driven Matrix Recovery via Optimal Shrinkage and Spatially Resolved Singular Vector Denoising under High-Dimensional Separable NoisePei-Chun Su
This paper develops a spatially resolved perturbation theory for singular vectors under high-dimensional separable noise and applies it to data-driven matrix recovery. In the asymptotic regime where the matrix dimensions are proportional and significantly larger than the signal rank, we derive exact leading-order variance formulas for the singular vector perturbation projected onto any spatial patch. The variance decomposes into a spatially non-uniform component governed by the local noise covariance and a spatially uniform component governed by the global noise level. These formulas provide the foundation for the \emph{extended optimal shrinkage and wavelet shrinkage} (e$\mathcal{OWS}$) algorithm, which recovers low-rank matrices satisfying a mixed Hölder condition. The pipeline begins with optimal shrinkage of singular values, then constructs coupled multiscale partition trees on the row and column spaces from the denoised estimate, generating a tensor Haar-Walsh wavelet basis. Spatially adaptive wavelet shrinkage is applied using data-driven, coefficient-level thresholds derived from the perturbation theory. We establish convergence rates that strictly improve upon both optimal shrinkage and wavelet shrinkage applied in isolation. Numerical simulations demonstrate reliable matrix recovery and accurate reconstruction of the underlying singular subspaces, including an application to fetal ECG extraction.
NAJun 18, 2025
Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network SparsityOluwadamilola Fasina, Ruben V. C. Pohle, Pei-Chun Su et al.
We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.