Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning
This addresses the challenge of multilingual alignment for low-resource languages by leveraging more accessible image-caption data, though it is incremental as it builds on existing contrastive learning methods.
The paper tackles the problem of multilingual sentence representation alignment without requiring bitexts by using visual information from image-caption datasets, which are easier to create for low-resource languages. The result shows that this approach can implicitly align text representations between languages, incorporate unseen languages post-hoc, and achieve usability for cross-lingual NLU and bitext retrieval.
Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.