DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
This work addresses the challenge of bi-directional generation in vision-and-language tasks, which is incremental as it builds on existing models with novel pre-training objectives.
The paper tackles the problem of vision-and-language generation by proposing DU-VLG, a framework that unifies these tasks via dual sequence-to-sequence pre-training, resulting in better performance than previous state-of-the-art systems on three tasks and generating real, relevant images and faithful, informative captions as confirmed by human judges.
Due to the limitations of the model structure and pre-training objectives, existing vision-and-language generation models cannot utilize pair-wise images and text through bi-directional generation. In this paper, we propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems. DU-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks. To bridge the gap between image understanding and generation, we further design a novel commitment loss. We compare pre-training objectives on image captioning and text-to-image generation datasets. Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss. We also obtain higher scores compared to previous state-of-the-art systems on three vision-and-language generation tasks. In addition, human judges further confirm that our model generates real and relevant images as well as faithful and informative captions.