Method DriftLLM quantization

Superseded baseline#8 of 80 most-superseded

AQLM

LLM quantization

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 5 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites AQLM as a baseline.

  • Each codebook is 1MiB. During inference, weights are read from these codebook in an essentialy random access pattern, meaning that the entire codebook must fit in L1 cache to enable fast inference (even L2 cache is too slow). However, 1MiB is larger than any current GPU's L1 cache (the H100 has 256KB), so AQLM inference suffers from high cache miss rates and is actually slower than FP16 on modern GPUs
    QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
  • AQLM proposes learning free-form VQs for different groups, which allows for more flexible quantization. However, this approach has the drawback that decoding requires the lookup operation, which is computationally more expensive than QUIP# and other existing PTQ methods.
    Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression
  • Among PTQ methods, the vector quantization method AQLM effectively mitigates some of the quantization loss, achieving 64.1 points, it falls 10.5 points short of full precision. The best quantization-aware training method, EfficientQAT, still suffers a 9.1-point decline in average accuracy.
    ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

Beaten on benchmarks

Head-to-head results where a newer method reports beating AQLM. Values are copied from the source paper's tables — verify against the cited paper.

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.