CVNov 14, 2025

Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

arXiv:2511.11427v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for multilingual visual grounding systems to support global deployment, representing an incremental advance by extending existing English-centric methods to multiple languages.

The paper tackles the problem of multilingual referring expression comprehension by constructing a unified dataset spanning 10 languages and introducing an attention-anchored neural architecture, achieving 86.9% accuracy at IoU@50 on RefCOCO in multilingual evaluation compared to 91.3% for English-only.

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes