CLNov 3, 2024

Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, Tagyoung Chung

arXiv:2411.01610v115.226 citationsh-index: 29Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing open-ended text generation for NLP researchers and practitioners by providing a more effective decoding method, though it is incremental as it builds on existing contrastive decoding techniques.

The paper tackles the problem of understanding and improving contrastive decoding (CD) for language models by theoretically proving it as linear extrapolation from a hypothetical huge LM and highlighting its limitations, then proposes Asymptotic Probability Decoding (APD) to infer probabilities from an infinitely large LM without extra cost, achieving state-of-the-art factuality in FactualityPrompts and better performance in commonsense QA datasets, such as reducing perplexity on Pythia 6.9B below that of Pythia 12B in CommonsenseQA and LAMBADA.

Contrastive decoding (CD) (Li et al., 2023) improves the next-token distribution of a large expert language model (LM) using a small amateur LM. Although CD is applied to various LMs and domains to enhance open-ended text generation, it is still unclear why CD often works well, when it could fail, and how we can make it better. To deepen our understanding of CD, we first theoretically prove that CD could be viewed as linearly extrapolating the next-token logits from a huge and hypothetical LM. We also highlight that the linear extrapolation could make CD unable to output the most obvious answers that have already been assigned high probabilities by the amateur LM. To overcome CD's limitation, we propose a new unsupervised decoding method called $\mathbf{A}$symptotic $\mathbf{P}$robability $\mathbf{D}$ecoding (APD). APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the asymptotic probabilities from an infinitely large LM without inducing more inference costs than CD. In FactualityPrompts, an open-ended text generation benchmark, sampling using APD significantly boosts factuality in comparison to the CD sampling and its variants, and achieves state-of-the-art results for Pythia 6.9B and OPT 6.7B. Furthermore, in five commonsense QA datasets, APD is often significantly better than CD and achieves a similar effect of using a larger LLM. For example, the perplexity of APD on top of Pythia 6.9B is even lower than the perplexity of Pythia 12B in CommonsenseQA and LAMBADA.

View on arXiv PDF Code

Similar