CLApr 16, 2021

Surface Form Competition: Why the Highest Probability Answer Isn't Always Right

arXiv:2104.08315v9716 citations
Originality Incremental advance
AI Analysis

This addresses a specific issue in zero-shot evaluation for researchers and practitioners using large language models, though it is an incremental improvement over existing scoring methods.

The paper tackles the problem of surface form competition in zero-shot multiple choice tasks with large language models, where different strings representing the same concept lower the probability of correct answers, and introduces Domain Conditional Pointwise Mutual Information as an alternative scoring function that achieves consistent performance gains across GPT-2 and GPT-3 models on various datasets.

Large language models have shown promising results in zero-shot settings (Brown et al.,2020; Radford et al., 2019). For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition-wherein different surface forms compete for probability mass, even if they represent the same underlying concept, e.g. "computer" and "PC." Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task. It achieves consistent gains in zero-shot performance over both calibrated (Zhao et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes