CLAIFeb 11, 2025

WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

arXiv:2502.07747v12 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the need for robust evaluation benchmarks in narrative reasoning for AI researchers, though it is incremental as it focuses on dataset creation and testing with existing models.

The authors tackled the problem of evaluating deductive reasoning in large language models using a new dataset from mystery stories, finding that while models perform reliably on unaltered texts, accuracy drops with certain name substitutions, particularly for widely recognized entities.

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes