CLAIOct 20, 2023

Self-Consistency of Large Language Models under Ambiguity

arXiv:2310.13439v1139 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the issue of inconsistent answers in LLMs for tasks like question-answering, which is incremental as it builds on existing evaluation methods.

The paper tackles the problem of self-consistency in large language models under ambiguity, finding that models achieve 67% to 82% consistency in an ambiguous integer sequence task, which increases with model capability, but are uncalibrated in judging their own consistency.

Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes