Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility
This work addresses the need for better discourse-level evaluation in NLP, though it is incremental as it builds on existing theoretical research to propose a new diagnostic task.
The authors tackled the problem of evaluating discourse-level understanding in natural language processing by proposing anaphora accessibility as a diagnostic task, and found that while LLMs and humans perform similarly on some aspects, they diverge due to LLMs' reliance on lexical items versus human structural sensitivity.
We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs' reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.