Perplexity Cannot Always Tell Right from Wrong
This work addresses a critical limitation in evaluating language models for researchers and practitioners, highlighting that perplexity can mislead model selection, which is incremental as it builds on prior empirical critiques with formal proofs.
The paper rigorously proves that perplexity is an unsuitable metric for model selection in decoder-only Transformer models, showing that a model can have low perplexity on some sequences while failing to predict others correctly, and that increases in model confidence do not guarantee better accuracy.
Perplexity -- a function measuring a model's overall level of "surprise" when encountering a particular output -- has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently -- a necessary pre-requisite for strong generalisation -- it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model -- rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.