CLJan 15

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

Parisa Rabbani, Priyam Sahoo, Ruben Mathew, Aishee Mondal, Harshita Ketharaman, Nimet Beyza Bozdag, Dilek Hakkani-Tür

arXiv:2601.10896v11 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses reliability issues for users relying on LLMs as third-party judges in dialogue contexts, though it is incremental in framing it as a calibration problem beyond accuracy optimization.

The study identified that LLMs judge identical claims differently based on framing, such as statement verification versus speaker attribution, with shifts up to 87 percentage points across domains, while accuracy remained stable within 2 percentage points. They introduced DialDefer to detect and mitigate these framing-induced judgment shifts, finding that mitigation attempts could reduce deference but risk over-correction into skepticism.

LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p < .0001) while accuracy remains stable (<2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain -- the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.

View on arXiv PDF

Similar