KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
This work addresses the need for a Korean-specific factuality benchmark for LLMs, though it is incremental as it adapts an existing English benchmark to a new language.
The authors tackled the problem of evaluating factuality in large language models for Korean cultural knowledge by introducing KoSimpleQA, a benchmark of 1,000 fact-seeking questions, and found that the strongest model achieved only 33.7% accuracy, highlighting its challenging nature.
We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.