Chuanfu Xiao

NA
h-index22
4papers
15citations
Novelty56%
AI Score34

4 Papers

CLJun 24, 2025
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

Zeyu Li, Chuanfu Xiao, Yang Wang et al.

Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models. Nevertheless, minimizing the accuracy degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. While scalar quantization is constrained by 1-bit bound, vector quantization exploits intra-vector correlations and enables sub-bit regimes, making it more suitable for ultra-low-bit quantization. To further mitigate quantization-induced degradation, we reveal that the degradation is highly uneven across tokens in attention quality. To investigate this unevenness, we introduce anchor score to measure each token's sensitivity to quantization. Our analysis and experiments show that preserving a small subset (1\%) of tokens with the highest Anchor Score significantly mitigates accuracy loss under aggressive quantization. We propose AnTKV, a dual-stage framework that leverages anchor token-aware vector quantization to compress the KV cache. It combines offline token-aware centroids learning and online anchor token selection to balance compression and accuracy. To enable efficient deployment, we design an online anchor token selection kernel compatible with FlashAttention. It allows LLaMA3-8B to scale to 840K tokens on a single 80GB A100, while delivering up to $3.5\times$ higher decoding throughput over the FP16 baseline. Experiments demonstrate that AnTKV matches or surpasses prior methods at 4-bit, and significantly reduce perplexity under ultra-low-bit quantization, achieving 6.32 at 1-bit on Mistral-7B, compared to 7.25 for CQ and 15.36 for KVQuant.

NAOct 25, 2021
A rank-adaptive higher-order orthogonal iteration algorithm for truncated Tucker decomposition

Chuanfu Xiao, Chao Yang

We propose a novel rank-adaptive higher-order orthogonal iteration (HOOI) algorithm to compute the truncated Tucker decomposition of higher-order tensors with a given error tolerance, and prove that the method is locally optimal and monotonically convergent. A series of numerical experiments related to both synthetic and real-world tensors are carried out to show that the proposed rank-adaptive HOOI algorithm is advantageous in terms of both accuracy and efficiency. Some further analysis on the HOOI algorithm and the classical alternating least squares method are presented to further understand why rank adaptivity can be introduced into the HOOI algorithm and how it works.

DCOct 20, 2020
a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

Min Li, Chuanfu Xiao, Chao Yang

Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data. Existing Tucker decomposition algorithms usually rely on a single solver to compute the factor matrices and core tensor, and are not flexible enough to adapt with the diversities of the input data and the hardware. Moreover, to exploit highly efficient GEMM kernels, most Tucker decomposition implementations make use of explicit matricizations, which could introduce extra costs in terms of data conversion and memory usage. In this paper, we present a-Tucker, a new framework for input-adaptive and matricization-free Tucker decomposition of dense tensors. A mode-wise flexible Tucker decomposition algorithm is proposed to enable the switch of different solvers for the factor matrices and core tensor, and a machine-learning adaptive solver selector is applied to automatically cope with the variations of both the input data and the hardware. To further improve the performance and enhance the memory efficiency, we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices. Experiments with a variety of synthetic and real-world tensors show that a-Tucker can substantially outperform existing works on both CPUs and GPUs.

NAApr 6, 2020
Efficient Alternating Least Squares Algorithms for Low Multilinear Rank Approximation of Tensors

Chuanfu Xiao, Chao Yang, Min Li

The low multilinear rank approximation, also known as the truncated Tucker decomposition, has been extensively utilized in many applications that involve higher-order tensors. Popular methods for low multilinear rank approximation usually rely directly on matrix SVD, therefore often suffer from the notorious intermediate data explosion issue and are not easy to parallelize, especially when the input tensor is large. In this paper, we propose a new class of truncated HOSVD algorithms based on alternating least squares (ALS) for efficiently computing the low multilinear rank approximation of tensors. The proposed ALS-based approaches are able to eliminate the redundant computations of the singular vectors of intermediate matrices and are therefore free of data explosion. Also, the new methods are more flexible with adjustable convergence tolerance and are intrinsically parallelizable on high-performance computers. Theoretical analysis reveals that the ALS iteration in the proposed algorithms is q-linear convergent with a relatively wide convergence region. Numerical experiments with large-scale tensors from both synthetic and real-world applications demonstrate that ALS-based methods can substantially reduce the total cost of the original ones and are highly scalable for parallel computing.