BitNet
BitNet: Scaling 1-bit Transformers for Large Language ModelsLLM quantization · first seen Oct 17, 2023
superseded — cited as a baseline and beaten by newer methods
5 papers critique it · 1 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites BitNet as a baseline.
“Demonstrates feasibility of extreme quantization but targets different domain (LLMs vs CNNs) and hardware (data center GPUs vs commodity CPUs).”
— True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity“BitNet has demonstrated the potential of ternary weight representations, yet requires as many as 2T tokens to establish a stable low-bit model.”
— Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs“However, the prolonged training duration and inherently limited scalability significantly constrain their practical deployment.”
— CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs“this framework typically demands pre-training from scratch to ensure convergence, incurring prohibitive computational costs that hinder widespread adoption”
— HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs“BitNet a4.8 addresses this issue by using resource-intensive quantization-aware training (QAT) to achieve 1-bit weights with 4-bit activations.”
— BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs
Beaten on benchmarks
Head-to-head results where a newer method reports beating BitNet. Values are copied from the source paper's tables — verify against the cited paper.
- RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
RobuQ (w/o AMP) beats BitNet · FID [ImageNet steps=50 cfg=1.5]
17.97 vs 41.59
- RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
RobuQ (w/o AMP) beats BitNet · IS [ImageNet steps=50 cfg=1.5]
103.24 vs 44.32
- RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
RobuQ beats BitNet · FID [ImageNet steps=50 cfg=1.5 W1.58A2]
30.30 vs 41.59
- RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
RobuQ (w/o AMP) beats BitNet · FID [FFHQ steps=50 Uncondition W1.58A4]
25.62 vs 66.55
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- STaR-QuantSTaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language ModelsJun 3, 2026
- May 26, 2026
- May 1, 2026
- Bit-by-BitBit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMsApr 9, 2026
- Benford-QuantBenford's Law as a Distributional Prior for Post-Training Quantization of Large Language ModelsJan 29, 2026
- HestiaHESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMsJan 28, 2026
- Layer-Wise High-Impact Parameter Ratio OptimizationLayer-Wise High-Impact Parameter Ratio Optimization in Post-Training Quantization for Large Language ModelsNov 21, 2025
- Sep 28, 2025