CLOct 26, 2024

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

arXiv:2410.20200v124 citationsh-index: 20EMNLP
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding reasoning capabilities in LLMs for AI researchers, but it is incremental as it builds on existing diagnostic methods without major breakthroughs.

The study investigated whether LLMs like LLaMA 2 and Flan-T5 perform genuine transitive reasoning or rely on implicit cues, finding that both models use word/phrase overlaps, but Flan-T5 shows more resilience to knowledge and entity cues with less variance.

Evaluating Large Language Models (LLMs) on reasoning benchmarks demonstrates their ability to solve compositional questions. However, little is known of whether these models engage in genuine logical reasoning or simply rely on implicit cues to generate answers. In this paper, we investigate the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. We controlled for potential cues that might influence the models' performance, including (a) word/phrase overlaps across sections of test input; (b) models' inherent knowledge during pre-training or fine-tuning; and (c) Named Entities. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments (b and c), having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets, a hypothesis we leave to future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes