Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements
This addresses the problem of assessing LLMs' understanding of linguistic indexicals for NLP researchers, providing a new benchmark dataset, but it is incremental as it extends existing coreference resolution evaluations to a specific linguistic category.
This study evaluated LLM performance on coreference resolution with indexicals like I, you, here, and tomorrow, revealing that LLMs perform well with some indexicals (e.g., I) but struggle with others (e.g., you, here, tomorrow), and that syntactic cues have mixed effects on performance.
Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: https://github.com/metehanoguzz/LLMs-Indexicals-English.