CLLGNov 2, 2023

Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization

arXiv:2311.01544v330 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the need for efficient LLM compression for deployment, though it is incremental as it builds on existing compression methods with a new evaluation metric.

The study tackled the problem of accurately assessing compressed large language models by introducing Divergent Token Metrics, which revealed that 25% of attention components can be pruned beyond 90% and over 80% of parameters can be quantized to int8 without performance loss.

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components' impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually -- and that FDTM can identify those -- while standard metrics result in deteriorated outcomes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes