CLOct 21, 2025

KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

arXiv:2510.18368v14.91 citationsh-index: 9Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for a Korean-specific factuality benchmark for LLMs, though it is incremental as it adapts an existing English benchmark to a new language.

The authors tackled the problem of evaluating factuality in large language models for Korean cultural knowledge by introducing KoSimpleQA, a benchmark of 1,000 fact-seeking questions, and found that the strongest model achieved only 33.7% accuracy, highlighting its challenging nature.

We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes