CVFeb 25

Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

arXiv:2602.21956v11 citationsh-index: 11
Originality Highly original
AI Analysis

This addresses text image machine translation for real-world scenarios with cluttered layouts and diverse fonts, representing a novel method for a known bottleneck.

The paper tackles the problem of translating text embedded in high-resolution text-rich images, where existing methods struggle with text omission and semantic drift. The proposed GLoTran framework improves translation completeness and accuracy over state-of-the-art multimodal large language models.

Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes