CLMay 23, 2023

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

arXiv:2305.14279v412.154 citations

Originality Synthesis-oriented

AI Analysis

This addresses the issue of ensuring valid reasoning in AI systems for tasks requiring multi-step solutions, though it is incremental as it focuses on evaluation rather than a new method.

The paper tackles the problem of evaluating self-consistency in large language models for multi-step reasoning tasks, demonstrating that GPT-3 and GPT-4 variants exhibit poor consistency rates in hypothetical and compositional consistency.

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

View on arXiv PDF

Similar