Hey, wait a minute: on at-issue sensitivity in Language Models
This addresses the challenge of scalable naturalness assessment for dialogue systems, offering a systematic approach to test discourse-sensitive behavior, though it is incremental in applying linguistic concepts to LM evaluation.
This study tackled the problem of evaluating dialogue naturalness in language models by introducing the DGRC method, which found that LMs prefer to continue on at-issue content, with this effect enhanced in instruct-tuned models and modulated by cues like 'Hey, wait a minute'.
Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of 'naturalness' vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of 'at-issueness' to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., "Hey, wait a minute") are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.