CLLGMar 12, 2025

Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation

arXiv:2503.09443v2h-index: 1
AI Analysis

This addresses data scarcity in cross-lingual vision-language tasks for multilingual AI applications, though it appears incremental as it builds on existing models and methods.

The paper tackles the problem of enabling zero-shot image captioning in languages only seen during translation training by investigating multilingual generalization in vision-language models, finding that captioning emerges via language prefixes and follows scaling laws dependent on model multilinguality, size, and training samples.

Cross-lingual, cross-task transfer is challenged by task-specific data scarcity, which becomes more severe as language support grows and is further amplified in vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in languages encountered only in the translation task. In this setting, the encoder must learn to generate generalizable, task-aware latent vision representations to instruct the decoder via inserted cross-attention layers. To analyze scaling behavior, we train Florence-2 based and Gemma-2 based models (0.4B to 11.2B parameters) on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge using a language prefix, even when this language only appears in the translation task. We find that indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, model size, and seen training samples. Finally, we demonstrate that the scaling laws extend to downstream tasks, achieving competitive performance through fine-tuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes