CLCVMay 23, 2023

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

arXiv:2305.14281v2134 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing multimodal AI models for tasks requiring detailed visual understanding, though it is incremental as it builds on existing pretraining baselines.

The paper tackles the problem of learning fine-grained multimodal representations in vision-and-language pretraining by incorporating weakly-supervised visual relation data, resulting in improved zero-shot evaluations on both coarse-grained and fine-grained tasks.

Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes