CLSep 4, 2024

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Zhe Xu, Jiasheng Ye, Xiaoran Liu, Xiangyang Liu, Tianxiang Sun, Zhigeng Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu

arXiv:2409.02465v210.414 citationsh-index: 25

Originality Synthesis-oriented

AI Analysis

This provides a domain-specific benchmark for researchers studying long-context reasoning in LLMs, though it is incremental as it builds on existing evaluation efforts.

The authors tackled the problem of evaluating long-context reasoning in LLMs by creating DetectiveQA, a dataset of 1200 human-annotated questions from detective novels averaging over 100k tokens, and found persistent challenges in evidence retrieval for models like GPT-4, Claude, and LLaMA.

Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose \textbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.

View on arXiv PDF

Similar