CL AI LG MMDec 11, 2025

MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data

Christopher Driggers-Ellis, Detravious Brinkley, Ray Chen, Aashish Dhawan, Daisy Zhe Wang, Christan Grant

arXiv:2512.11074v13 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of stalled multimodal machine translation research for diverse, non-European languages by providing a new dataset, though it is incremental as it builds on existing extensions.

The authors tackled the limitation of the Multi30k dataset to only four European languages by creating MultiScript30k, an extension that translates the English version into five additional languages (Arabic, Spanish, Ukrainian, and two Chinese variants), resulting in over 30,000 sentences with similarity metrics like cosine similarity >0.8 and KL divergence <0.000251 for most languages.

Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over $30000$ sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh\_Hans and Zh\_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than $0.8$ cosine similarity and symmetric KL divergence less than $0.000251$ for all languages supported except Zh\_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.

View on arXiv PDF

Similar