CLLGMay 31, 2023

Large Language Models Are Not Strong Abstract Reasoners

arXiv:2305.19555v371 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses a fundamental limitation in AI by highlighting that LLMs struggle with abstract reasoning, which is crucial for achieving human-like cognitive capabilities, though it is incremental as it builds on existing evaluation methods.

The authors tackled the problem of evaluating large language models (LLMs) on abstract reasoning tasks, introducing a new benchmark and showing that state-of-the-art LLMs achieve very limited performance, often below 50% accuracy, in contrast to their success on other NLP tasks.

Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes