LGAIMay 23, 2025

LCD: Advancing Extreme Low-Bit Clustering for Large Language Models via Knowledge Distillation

arXiv:2506.12038v11 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses the high memory and computational requirements of LLMs for real-world deployment, offering a practical and cost-effective solution, though it is incremental as it builds on existing quantization and distillation techniques.

The paper tackles the challenge of deploying large language models (LLMs) by proposing LCD, a method that unifies clustering-based quantization with knowledge distillation to achieve effective low-bit compression at 2-3 bits, resulting in preserved performance and up to a 6.2x inference speedup.

Large language models (LLMs) have achieved significant progress in natural language processing but face challenges in deployment due to high memory and computational requirements. Weight quantization is a common approach to address these issues, yet achieving effective low-bit compression remains challenging. This paper presents LCD, which unifies the learning of clustering-based quantization within a knowledge distillation framework. Using carefully designed optimization techniques, LCD preserves LLM performance even at ultra-low bit widths of 2-3 bits. Additionally, LCD compresses activations through smoothing and accelerates inference with a LUT-based design. Experimental results show that LCD outperforms existing methods and delivers up to a 6.2x speedup in inference. Notably, LCD is shown to be more cost-effective, making it a practical solution for real-world applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes