Shaochen Zhong

LG
h-index5
9papers
1,012citations
Novelty55%
AI Score49

9 Papers

41.1LGMar 17, 2023Code
Data-centric Artificial Intelligence: A Survey

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai et al.

Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI

40.7CLFeb 5, 2024Code
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin et al.

Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6\times}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim 3.47\times}$ throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.

25.0LGApr 15, 2025Code
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong et al.

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3--46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7--14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code is available at https://github.com/LeanModels/DFloat11.

11.4LGFeb 20, 2025Code
Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen et al.

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

20.1CRFeb 29, 2024
LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

Hongyi Liu, Shaochen Zhong, Xintong Sun et al.

Finetuning LLMs with LoRA has gained significant popularity due to its simplicity and effectiveness. Often, users may even find pluggable, community-shared LoRAs to enhance their base models for a specific downstream task of interest; enjoying a powerful, efficient, yet customized LLM experience with negligible investment. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to a community eager to try out shared assets. Despite the high-risk potential, no prior art has comprehensively explored LoRA's attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, a backdoor LoRA can be trained once and then seamlessly merged (in a training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities. This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly infectious -- because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download -- and dangerous -- because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. Warning: This paper contains offensive content and involves a real-life tragedy.

9.6CLMay 25, 2025Code
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Wang Yang, Hongye Jin, Shaochen Zhong et al.

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

3.8LGDec 23, 2023Code
TVE: Learning Meta-attribution for Transferable Vision Explainer

Guanchu Wang, Yu-Neng Chuang, Fan Yang et al.

Explainable machine learning significantly improves the transparency of deep neural networks. However, existing work is constrained to explaining the behavior of individual model predictions, and lacks the ability to transfer the explanation across various models and tasks. This limitation results in explaining various tasks being time- and resource-consuming. To address this problem, we introduce a Transferable Vision Explainer (TVE) that can effectively explain various vision models in downstream tasks. Specifically, the transferability of TVE is realized through a pre-training process on large-scale datasets towards learning the meta-attribution. This meta-attribution leverages the versatility of generic backbone encoders to comprehensively encode the attribution knowledge for the input instance, which enables TVE to seamlessly transfer to explain various downstream tasks, without the need for training on task-specific data. Empirical studies involve explaining three different architectures of vision models across three diverse downstream datasets. The experimental results indicate TVE is effective in explaining these tasks without the need for additional training on downstream data.

6.6LGMay 24, 2023
Editable Graph Neural Network for Node Classifications

Zirui Liu, Zhimeng Jiang, Shaochen Zhong et al.

Despite Graph Neural Networks (GNNs) have achieved prominent success in many graph-based learning problem, such as credit risk assessment in financial networks and fake news detection in social networks. However, the trained GNNs still make errors and these errors may cause serious negative impact on society. \textit{Model editing}, which corrects the model behavior on wrongly predicted target samples while leaving model predictions unchanged on unrelated samples, has garnered significant interest in the fields of computer vision and natural language processing. However, model editing for graph neural networks (GNNs) is rarely explored, despite GNNs' widespread applicability. To fill the gap, we first observe that existing model editing methods significantly deteriorate prediction accuracy (up to $50\%$ accuracy drop) in GNNs while a slight accuracy drop in multi-layer perception (MLP). The rationale behind this observation is that the node aggregation in GNNs will spread the editing effect throughout the whole graph. This propagation pushes the node representation far from its original one. Motivated by this observation, we propose \underline{E}ditable \underline{G}raph \underline{N}eural \underline{N}etworks (EGNN), a neighbor propagation-free approach to correct the model prediction on misclassified nodes. Specifically, EGNN simply stitches an MLP to the underlying GNNs, where the weights of GNNs are frozen during model editing. In this way, EGNN disables the propagation during editing while still utilizing the neighbor propagation scheme for node prediction to obtain satisfactory results. Experiments demonstrate that EGNN outperforms existing baselines in terms of effectiveness (correcting wrong predictions with lower accuracy drop), generalizability (correcting wrong predictions for other similar nodes), and efficiency (low training time and memory) on various graph datasets.

13.0LGMay 24, 2023Code
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model

Zirui Liu, Guanchu Wang, Shaochen Zhong et al.

With the rapid growth in model size, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7$\times$ peak memory reduction with almost no accuracy drop and enables up to $6.4\times$ larger batch size. Under the same hardware, WTA-CRS enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.