LIEDER: Linguistically-Informed Evaluation for Discourse Entity Recognition
This work addresses the need for linguistically-informed evaluation in NLP to benchmark language models' semantic knowledge, though it is incremental as it builds on prior assessments of discourse entity recognition.
The authors tackled the problem of assessing which fundamental semantic properties of discourse entities large language models understand, by creating the LIEDER dataset to evaluate knowledge of existence, uniqueness, plurality, and novelty. They found that state-of-the-art models show sensitivity to all properties except novelty, indicating a gap in human-level language understanding.
Discourse Entity (DE) recognition is the task of identifying novel and known entities introduced within a text. While previous work has found that large language models have basic, if imperfect, DE recognition abilities (Schuster and Linzen, 2022), it remains largely unassessed which of the fundamental semantic properties that govern the introduction and subsequent reference to DEs they have knowledge of. We propose the Linguistically-Informed Evaluation for Discourse Entity Recognition (LIEDER) dataset that allows for a detailed examination of language models' knowledge of four crucial semantic properties: existence, uniqueness, plurality, and novelty. We find evidence that state-of-the-art large language models exhibit sensitivity to all of these properties except novelty, which demonstrates that they have yet to reach human-level language understanding abilities.