Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020
This work addresses the problem of generating high-quality translations by integrating video and text for multimodal machine translation, though it appears incremental as it builds on existing challenge frameworks.
The authors tackled the Video-guided Machine Translation Challenge 2020 by developing a system that uses keyframe segmentation and positional encoding for video features, achieving a corpus-level BLEU-4 score of 36.60 and securing first place in the challenge.
Video-guided machine translation as one of multimodal neural machine translation tasks targeting on generating high-quality text translation by tangibly engaging both video and text. In this work, we presented our video-guided machine translation system in approaching the Video-guided Machine Translation Challenge 2020. This system employs keyframe-based video feature extractions along with the video feature positional encoding. In the evaluation phase, our system scored 36.60 corpus-level BLEU-4 and achieved the 1st place on the Video-guided Machine Translation Challenge 2020.