Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation
This work addresses generalization challenges in DIMT for applications like document processing, but it is incremental as it builds on existing MLLM technology.
The paper tackles the problem of Document Image Machine Translation (DIMT), which struggles with generalization due to limited data and complex visual-textual interactions, by introducing M4Doc, a framework that aligns an image-only encoder with a Multimodal Large Language Model (MLLM) to improve translation quality, achieving substantial gains in cross-domain generalization and challenging scenarios.
Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.