AIOct 21, 2024

Language Model Probabilities are Not Calibrated in Numeric Contexts

Charles Lovering, Michael Krumdick, Viet Dac Lai, Seth Ebner, Nilesh Kumar, Varshini Reddy, Rik Koncel-Kedziorski, Chris Tanner

arXiv:2410.16007v214.712 citationsh-index: 9

Originality Incremental advance

AI Analysis

This highlights a fundamental flaw in language models for tasks requiring probabilistic reasoning, which is crucial for applications like decision-making and uncertainty quantification.

The paper tested whether language model output probabilities are calibrated to numeric likelihoods in prompts, finding that even top models like GPT-4o-mini and Llama-3.1-8B are poorly calibrated and show systematic biases, such as favoring word order over implied probabilities.

Some statements have one well-defined continuation (e.g., "the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g., "the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and word frequency all impact calibration. For example, gpt-4o-mini often picks the first of two options presented in the prompt regardless of the options' implied likelihoods, whereas Llama-3.1-8B picks the second. Models do not allocate probability mass among valid options in a calibrated manner.

View on arXiv PDF

Similar