IRAICLOct 25, 2025

PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading

arXiv:2510.22242v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for reliable LLM-based research assistants for scholars, providing a diagnostic framework, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating the reliability of LLMs in scholarly tasks by introducing PaperAsk, a benchmark that reveals consistent failures across citation retrieval, content extraction, paper discovery, and claim verification, with specific metrics like citation retrieval failing in 48-98% of cases and content extraction failing in 72-91% of cases.

Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes