CLHCMar 18, 2025

Navigating Rifts in Human-LLM Grounding: Study and Benchmark

Microsoft
arXiv:2503.13975v224 citationsh-index: 15ACL
Originality Incremental advance
AI Analysis

This addresses the problem of LLMs struggling with collaborative conversation grounding, which can lead to user frustration and serious consequences in high-stakes scenarios.

The paper studied grounding challenges in human-LLM conversations by analyzing three datasets, finding that LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans, and introduced the Rifts benchmark to highlight these failures.

Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding -- the process by which conversation participants establish mutual understanding -- can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, we find that early grounding failures predict later interaction breakdowns. Building on these insights, we introduce Rifts, a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on Rifts, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention aimed at mitigating grounding failures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes