Is LLM.int8() superseded?

Q: Is LLM.int8() superseded?

LLM.int8() (LLM quantization): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 0 beat it on benchmarks — #29 of 80 most-superseded. Sub-problem: cluster led by FlexRound.

Method Drift›LLM quantization

Superseded baseline#29 of 80 most-superseded

LLM.int8()

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

LLM quantization · first seen Aug 15, 2022

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 0 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites LLM.int8() as a baseline.

“However, this implementation results in significant latency overhead, sometimes even slower than FP16 inference.”
— Post Training Quantization of Large Language Models with Microscaling Formats
“the inference latency of LLM.int8() can be higher than that of the FP16 baseline”
— LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
“However, both LLM.int8() and ZeroQuant are not efficient for quantizing LLMs to extreme low-percision number formats such as 3-bit integers.”
— AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs