CL AIJul 1, 2024

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy

Princeton

arXiv:2407.01687v219.237 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This work addresses the debate on whether CoT prompting enables abstract generalization or relies on shallow heuristics in LLMs, providing insights for researchers and practitioners in AI and natural language processing, though it is incremental as it builds on existing CoT research.

The study investigated factors affecting Chain-of-Thought (CoT) prompting in large language models for symbolic reasoning tasks like decoding shift ciphers, finding that output probability, memorization, and noisy reasoning significantly influence accuracy, with GPT-4 accuracy varying from 26% to 70% based on probability.

Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this https://github.com/aksh555/deciphering_cot

View on arXiv PDF Code

Similar