SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
This provides a more reliable tool for the research community to measure parametric knowledge in LLMs, though it is incremental as it builds on an existing benchmark.
The authors tackled the problem of unreliable benchmarks for evaluating LLM factuality by introducing SimpleQA Verified, a 1,000-prompt benchmark that addresses limitations like noisy labels and biases, resulting in Gemini 2.5 Pro achieving a state-of-the-art F1-score of 55.6.
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.