CLAug 5, 2019

Predicting Actions to Help Predict Translations

arXiv:1908.01665v25 citations
AI Analysis

This work addresses the challenge of translating action-related text in multimodal contexts, offering incremental improvements for video-based translation tasks.

The paper tackles the problem of improving text translation by incorporating visual action features from videos, specifically on the How2 dataset, and finds that using these features alongside text increases translation quality, with concrete improvements measured in BLEU scores.

We address the task of text translation on the How2 dataset using a state of the art transformer-based multimodal approach. The question we ask ourselves is whether visual features can support the translation process, in particular, given that this is a dataset extracted from videos, we focus on the translation of actions, which we believe are poorly captured in current static image-text datasets currently used for multimodal translation. For that purpose, we extract different types of action features from the videos and carefully investigate how helpful this visual information is by testing whether it can increase translation quality when used in conjunction with (i) the original text and (ii) the original text where action-related words (or all verbs) are masked out. The latter is a simulation that helps us assess the utility of the image in cases where the text does not provide enough context about the action, or in the presence of noise in the input text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes