CLLGAug 10, 2025

Grounding Multilingual Multimodal LLMs With Cultural Knowledge

arXiv:2508.07414v210 citationsh-index: 20Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the cultural gap in multimodal AI systems, making them more globally inclusive, though it's an incremental improvement through targeted data augmentation.

The paper tackles the problem of multimodal large language models underperforming on cultural entities and low-resource languages by creating CulturalGround, a dataset of 22 million culturally-rich visual question answering pairs across 42 countries and 39 languages, and training CulturalPangea, which achieves state-of-the-art performance among open models on culture-focused benchmarks with an average 5.0% improvement.

Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes