CLAIApr 30, 2024

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

arXiv:2404.19432v13.46 citationsh-index: 2NeSy
Originality Incremental advance
AI Analysis

This work addresses the problem of distinguishing statistical inference from true reasoning in LLMs for AI researchers, highlighting a core limitation in current models.

The paper investigates whether large language models (LLMs) can reason about implicitly-held knowledge by probing their ability to compare cardinalities, such as the number of legs on a bird versus wheels on a tricycle, and finds that LLMs rely on statistical inference rather than genuine reasoning, with performance improving incrementally across GPT releases but remaining limited.

Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes