LGJul 12, 2024

Accuracy is Not All You Need

arXiv:2407.09141v114 citationsh-index: 47
Originality Incremental advance
AI Analysis

This addresses the issue for researchers and practitioners in model compression, highlighting that current evaluation methods are insufficient, though it is incremental in proposing new metrics rather than a fundamental breakthrough.

The paper tackles the problem of evaluating compressed large language models beyond just accuracy, showing that even with similar accuracy, compressed models exhibit significant answer flips and perform worse in free-form generative tasks like MT-Bench, with qualitative and quantitative evidence of degradation.

When Large Language Models (LLMs) are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and show that compressed models are significantly worse than baseline models in this free-form generative task.Thus, we argue that compression techniques should also be evaluated using distance metrics.We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes