IRAICLLGSep 10, 2021

MURAL: Multimodal, Multitask Retrieval Across Languages

arXiv:2109.05125v163 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of limited image-caption data for under-resourced languages in multimodal AI, representing an incremental advancement over existing methods.

The paper tackles the problem of improving cross-modal retrieval performance, especially for under-resourced languages, by extending a dual encoder with translation pairs, resulting in an average zero-shot mean recall improvement of 8.1% for eight under-resourced languages.

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes