CLDec 20, 2022

True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4

arXiv:2212.10114v222.2233 citationsh-index: 22

Originality Incremental advance

AI Analysis

This provides a challenging benchmark for future research on reasoning in language models, highlighting a significant gap between LLMs and humans, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating deep reasoning in large language models by introducing a benchmark of 191 long-form mystery puzzles, where GPT-3 achieved 28% accuracy and GPT-4 only 38%, compared to 47% for average humans.

Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.

View on arXiv PDF

Similar