CLSep 7, 2025

KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino

Lorenzo Alfred Nery, Ronald Dawson Catignas, Thomas James Tiam-Lee

arXiv:2509.06065v1h-index: 5

Originality Synthesis-oriented

AI Analysis

This addresses the gap in evaluating LLM truthfulness for Filipino speakers, but it is incremental as it adapts an existing benchmark to a new language.

The authors tackled the problem of evaluating truthfulness in large language models for low-resource languages by creating KatotohananQA, a Filipino translation of the TruthfulQA benchmark, and found a significant performance gap between English and Filipino, with newer OpenAI models showing strong multilingual robustness.

Large Language Models (LLMs) achieve remarkable performance across various tasks, but their tendency to produce hallucinations limits reliable adoption. Benchmarks such as TruthfulQA have been developed to measure truthfulness, yet they are primarily available in English, leaving a gap in evaluating LLMs in low-resource languages. To address this, we present KatotohananQA, a Filipino translation of the TruthfulQA benchmark. Seven free-tier proprietary models were assessed using a binary-choice framework. Findings show a significant performance gap between English and Filipino truthfulness, with newer OpenAI models (GPT-5 and GPT-5 mini) demonstrating strong multilingual robustness. Results also reveal disparities across question characteristics, suggesting that some question types, categories, and topics are less robust to multilingual transfer which highlight the need for broader multilingual evaluation to ensure fairness and reliability in LLM usage.

View on arXiv PDF

Similar