CLAIJan 14

Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity

arXiv:2601.09041v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating LLM-human alignment in nuanced language interpretation for AI and linguistics researchers, but it is incremental as it builds on existing comparisons.

The study investigated whether large language models (LLMs) align with human judgments in interpreting figurative and socially grounded language, finding that while they show surface-level similarity, they diverge significantly at the representational level, especially for idioms and Gen Z slang, with GPT-4 performing closest to humans.

Large language models generate judgments that resemble those of humans. Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes