Adding Multimodal Capabilities to a Text-only Translation Model
This work addresses the overfitting issue in multimodal translation for researchers and practitioners, though it is incremental as it builds on existing text-only models.
The paper tackled the problem of multimodal machine translation models overfitting to the Multi30k dataset and performing poorly on text-only datasets by starting with a performant text-only translation model and adding vision-text adapters with gating, achieving improved performance on both Multi30k and text-only benchmarks.
While most current work in multimodal machine translation (MMT) uses the Multi30k dataset for training and evaluation, we find that the resulting models overfit to the Multi30k dataset to an extreme degree. Consequently, these models perform very badly when evaluated against typical text-only testing sets such as the WMT newstest datasets. In order to perform well on both Multi30k and typical text-only datasets, we use a performant text-only machine translation (MT) model as the starting point of our MMT model. We add vision-text adapter layers connected via gating mechanisms to the MT model, and incrementally transform the MT model into an MMT model by 1) pre-training using vision-based masking of the source text and 2) fine-tuning on Multi30k.