CLMay 30, 2025

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Emilio Villa-Cueva, Sholpan Bolatzhanova, Diana Turmakhan, Kareem Elzeky, Henok Biadglign Ademtew, Alham Fikri Aji, Vladimir Araujo, Israel Abebe Azime, Jinheon Baek, Frederico Belcavello, Fermin Cristobal, Jan Christian Blaise Cruz

arXiv:2505.24456v29.63 citationsh-index: 39EMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of cultural nuance in machine translation for researchers and developers, though it is incremental as it builds on existing multimodal translation efforts.

The paper tackled the problem of translating cultural content by investigating whether images can provide cultural context in multimodal machine translation, resulting in the creation of the CaMMT benchmark with over 5,800 triples and finding that visual context improves translation quality, especially for culturally-specific items and disambiguation.

Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.

View on arXiv PDF

Similar