CLJan 10, 2025

Multi-Step Reasoning in Korean and the Emergent Mirage

arXiv:2501.05712v212 citationsh-index: 5Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of culturally specific reasoning for AI researchers, but it is incremental as it builds on prior observations of emergent abilities.

The authors tackled the problem of evaluating large language models' ability to perform multi-step reasoning in culturally specific Korean contexts, finding that models trained on fewer than 2e25 FLOPs show near-zero performance, with state-of-the-art models scoring under 50%.

We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than \(2 \cdot 10^{25}\) training FLOPs struggle to solve any questions, showing near-zero performance. Beyond this threshold, performance improves sharply. State-of-the-art models (e.g., O1) still score under 50\%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination.

View on arXiv PDF

Similar