CVCLJul 23, 2025

Dual-branch Prompting for Multimodal Machine Translation

arXiv:2507.17588v13 citations
Originality Incremental advance
AI Analysis

This addresses robustness and practical applicability issues in multimodal machine translation for translation tasks, though it appears incremental as it builds on existing methods with a novel prompting strategy.

The paper tackled the problem of multimodal machine translation being sensitive to irrelevant visual noise and requiring paired image-text inputs at inference, proposing a diffusion-based dual-branch prompting framework that uses reconstructed images to filter distractions. The result was superior translation performance on the Multi30K dataset compared to state-of-the-art approaches.

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes