CL AIMay 26

Beyond Questions: Evaluating What Large Language Models (Actually) Know

arXiv:2605.2693783.9

AI Analysis

For researchers and practitioners evaluating LLM knowledge, this work addresses the availability bias in existing benchmarks by shifting focus to knowledge models naturally express.

The paper introduces open knowledge evaluation, a new paradigm for benchmarking LLM knowledge that uses open-ended prompts instead of predefined questions, and presents BeQu, a benchmark of 10,000 entities. Results show that this approach reveals different knowledge characteristics than traditional QA benchmarks.

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

View on arXiv PDF

Similar