Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination
This addresses a more realistic setup for multimodal translation where images are unavailable at test time, which is incremental as it builds on existing unsupervised methods but adapts them to a new inference scenario.
The paper tackles the problem of unsupervised multimodal machine translation without images at inference time by using scene graphs to represent images and texts, and introduces a visual scene hallucination mechanism to generate pseudo visual scene graphs from text. The method outperforms the best baseline by significant BLEU scores on the Multi30K dataset, producing translations with better completeness, relevance, and fluency.
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs. First, we represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics. To enable pure-text input during inference, we devise a visual scene hallucination mechanism that dynamically generates pseudo visual SG from the given textual SG. Several SG-pivoting based learning objectives are introduced for unsupervised translation training. On the benchmark Multi30K data, our SG-based method outperforms the best-performing baseline by significant BLEU scores on the task and setup, helping yield translations with better completeness, relevance and fluency without relying on paired images. Further in-depth analyses reveal how our model advances in the task setting.