CLSep 8, 2025

Do LLMs exhibit the same commonsense capabilities across languages?

Ivan Martínez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt

arXiv:2509.06401v14.92 citationsh-index: 3Has Code

Originality Synthesis-oriented

AI Analysis

This work highlights limitations in LLMs for multilingual commonsense tasks, which is important for developers and researchers in natural language processing, though it is incremental as it extends an existing dataset.

The paper investigates multilingual commonsense generation by LLMs, finding that performance is best in English and significantly lower in less-resourced languages, with mixed results from contextual support.

This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation combines automatic metrics, LLM-as-a-judge approaches (using Prometheus and JudgeLM), and human annotations. Results consistently show superior performance in English, with significantly lower performance in less-resourced languages. While contextual support yields mixed results, it tends to benefit underrepresented languages. These findings underscore the current limitations of LLMs in multilingual commonsense generation. The dataset is publicly available at https://huggingface.co/datasets/gplsi/MULTICOM.

View on arXiv PDF

Similar