CLMar 26, 2024

Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons

CMU
arXiv:2403.17760v286 citationsh-index: 18LREC
Originality Incremental advance
AI Analysis

This work highlights a specific failure mode in LLMs for linguistic understanding, which is incremental but important for NLP researchers.

The authors introduced a challenge dataset for natural language inference with high lexical overlap to test large language models, finding that GPT-4 and Llama 2 failed with strong bias and could not distinguish between adjective constructions based on meaning.

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM's understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don't adequately represent their meaning or capture the lexical properties of phrasal heads.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes