AIETJan 29

Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

arXiv:2601.21975v1
Originality Incremental advance
AI Analysis

This addresses the problem of accurately measuring preferences in language models for AI safety and alignment research, but it is incremental as it builds on existing work on preference gaps.

The study investigated how elicitation protocols affect the stated-revealed preference gap in language models, finding that allowing neutrality and abstention in stated preferences improves correlation (e.g., Spearman's ρ), but allowing abstention in revealed preferences reduces correlation to near-zero or negative values.

Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman's rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes