CLAICVLGFeb 22, 2022

VU-BERT: A Unified framework for Visual Dialog

arXiv:2202.10787v11 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of simplifying multi-modal interactions for researchers in visual dialog, though it appears incremental.

The paper tackles the visual dialog task by proposing VU-BERT, a unified framework for image-text joint embedding, achieving a competitive NDCG score of 0.7287 on the VisDial v1.0 dataset.

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes