CL AIOct 20, 2023

Self-Consistency of Large Language Models under Ambiguity

Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau

arXiv:2310.13439v122.0139 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses the issue of inconsistent answers in LLMs for tasks like question-answering, which is incremental as it builds on existing evaluation methods.

The paper tackles the problem of self-consistency in large language models under ambiguity, finding that models achieve 67% to 82% consistency in an ambiguous integer sequence task, which increases with model capability, but are uncalibrated in judging their own consistency.

Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.

View on arXiv PDF Code

Similar