LGOct 20, 2023

CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages

arXiv:2310.13683v2136 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of limited linguistic diversity in vision-language models for low-resource languages, offering a practical solution for researchers and practitioners with constrained computational resources.

The paper tackles the challenge of improving multilingual CLIP performance for low-resource languages by introducing CAPIVARA, a cost-efficient framework that uses synthetic caption generation and optimization techniques, achieving state-of-the-art results in zero-shot tasks with Portuguese texts and enabling fine-tuning on a single GPU in 2 hours.

This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack linguistic diversity, featuring solely English descriptions for images. CAPIVARA addresses this by augmenting text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages. We optimize the training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the computational cost. Through extensive experiments, CAPIVARA emerges as state of the art in zero-shot tasks involving images and Portuguese texts. We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a single GPU for 2 hours. Our model and code is available at https://github.com/hiaac-nlp/CAPIVARA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes