LGAIApr 7, 2025

Achieving binary weight and activation for LLMs using Post-Training Quantization

arXiv:2504.05352v33 citationsh-index: 1Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses computational cost reduction for LLM deployment, but it is incremental as it builds on existing quantization methods.

The paper tackles the problem of performance degradation in large language models (LLMs) when quantizing weights and activations below 4 bits, proposing a post-training quantization framework with W(1+1)A(1*4) configuration that surpasses state-of-the-art baselines on W2A4 across multiple tasks.

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes