CLAIJan 10, 2024

I am a Strange Dataset: Metalinguistic Tests for Language Models

arXiv:2401.05300v233 citationsh-index: 23Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses a fundamental limitation in language models for tasks involving self-referential language, which is incremental as it builds on existing evaluation methods.

The paper tackles the problem of whether large language models can handle metalinguistic self-reference, such as in statements like 'This paper has six sections.', and finds that most models perform close to chance, with GPT-4 achieving only around 60% accuracy compared to human scores of 89-93%.

Statements involving metalinguistic self-reference ("This paper has six sections.") are prevalent in many domains. Can current large language models (LLMs) handle such language? In this paper, we present "I am a Strange Dataset", a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like "The penultimate word in this sentence is" (where a correct continuation is "is"). In verification, models judge the truth of statements like "The penultimate word in this sentence is sentence." (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes