CLMar 5, 2024

The Case for Evaluating Multimodal Translation Models on Text Datasets

Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup

arXiv:2403.03014v12.74 citationsh-index: 8

Originality Synthesis-oriented

AI Analysis

This addresses the need for better evaluation standards in MMT research to ensure models are properly assessed for their intended capabilities, though it is incremental as it focuses on improving existing evaluation practices rather than introducing a new model or method.

The paper tackles the problem of inadequate evaluation for multimodal machine translation (MMT) models by proposing a new framework that includes CoMMuTE for visual information use, WMT news translation sets for complex sentences, and Multi30k for real MMT data, and shows that recent MMT models trained on Multi30k suffer a dramatic performance drop on text-only sets compared to text-only models.

A good evaluation framework should evaluate multimodal machine translation (MMT) models by measuring 1) their use of visual information to aid in the translation task and 2) their ability to translate complex sentences such as done for text-only machine translation. However, most current work in MMT is evaluated against the Multi30k testing sets, which do not measure these properties. Namely, the use of visual information by the MMT model cannot be shown directly from the Multi30k test set results and the sentences in Multi30k are are image captions, i.e., short, descriptive sentences, as opposed to complex sentences that typical text-only machine translation models are evaluated against. Therefore, we propose that MMT models be evaluated using 1) the CoMMuTE evaluation framework, which measures the use of visual information by MMT models, 2) the text-only WMT news translation task test sets, which evaluates translation performance against complex sentences, and 3) the Multi30k test sets, for measuring MMT model performance against a real MMT dataset. Finally, we evaluate recent MMT models trained solely against the Multi30k dataset against our proposed evaluation framework and demonstrate the dramatic drop performance against text-only testing sets compared to recent text-only MT models.

View on arXiv PDF

Similar