CLCVAug 19, 2021

Contrastive Language-Image Pre-training for the Italian Language

arXiv:2108.08688v139 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of limited data and translation quality for non-English languages, specifically for Italian, but is incremental as it applies an existing method to a new language.

The paper tackled the problem of adapting the CLIP model to the Italian language by training it on over 1.4 million image-text pairs, resulting in CLIP-Italian outperforming the multilingual CLIP model in image retrieval and zero-shot classification tasks.

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes