Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context
This addresses the problem of English-centric bias in AI datasets for researchers and developers working on Korean language models and inclusive commonsense reasoning.
The authors tackled the lack of cultural diversity in physical commonsense reasoning datasets by introducing Ko-PIQA, a Korean dataset with cultural context, resulting in 441 high-quality question-answer pairs and model accuracies ranging from 59.86% to 83.22%.
Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reaches only 59.86%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.