CLFeb 11

Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

arXiv:2602.10732v1h-index: 42Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating multilingual and multicultural reasoning for AI researchers, though it is incremental as it builds on existing benchmark methodologies with a novel template-based approach.

The authors tackled the lack of multilingual benchmarks that test reasoning over culturally grounded premises by creating Macaron, a controlled benchmark with 11,862 instances across 20 languages, which revealed that reasoning-mode models perform strongly with near-parity between English and local languages, while open-weight models degrade substantially in local languages.

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes