Superseded baseline#29 of 80 most-superseded
LLM.int8()
LLM.int8(): 8-bit Matrix Multiplication for Transformers at ScaleLLM quantization · first seen Aug 15, 2022
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 0 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites LLM.int8() as a baseline.
“However, this implementation results in significant latency overhead, sometimes even slower than FP16 inference.”
— Post Training Quantization of Large Language Models with Microscaling Formats“the inference latency of LLM.int8() can be higher than that of the FP16 baseline”
— LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices“However, both LLM.int8() and ZeroQuant are not efficient for quantizing LLMs to extreme low-percision number formats such as 3-bit integers.”
— AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs