CVAICLMar 17, 2022

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

arXiv:2203.09052v1641 citationsh-index: 24
Originality Highly original
AI Analysis

This work addresses the challenge of bi-directional generation in vision-and-language tasks, which is incremental as it builds on existing models with novel pre-training objectives.

The paper tackles the problem of vision-and-language generation by proposing DU-VLG, a framework that unifies these tasks via dual sequence-to-sequence pre-training, resulting in better performance than previous state-of-the-art systems on three tasks and generating real, relevant images and faithful, informative captions as confirmed by human judges.

Due to the limitations of the model structure and pre-training objectives, existing vision-and-language generation models cannot utilize pair-wise images and text through bi-directional generation. In this paper, we propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems. DU-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks. To bridge the gap between image understanding and generation, we further design a novel commitment loss. We compare pre-training objectives on image captioning and text-to-image generation datasets. Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss. We also obtain higher scores compared to previous state-of-the-art systems on three vision-and-language generation tasks. In addition, human judges further confirm that our model generates real and relevant images as well as faithful and informative captions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes