CLAIFeb 10, 2025

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?

arXiv:2502.06600v213 citationsh-index: 17NAACL
Originality Incremental advance
AI Analysis

This work addresses the problem of multilingual image captioning evaluation, which is significant for applications that require image captioning in multiple languages, particularly for those with limited native language data.

This work tackled the problem of evaluating multilingual image captioning, achieving high correlation with human judgements across different languages, with finetuned multilingual models generalizing well. The results showed that multilingual CLIPScore models maintained a high correlation with human judgements, with no specific numbers provided.

The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes