CVLGJul 12, 2022

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

arXiv:2207.05333v215 citationsh-index: 54
Originality Highly original
AI Analysis

This addresses the limitation of existing VLP methods that rely on time-consuming object detectors with predefined categories, offering a more efficient and flexible approach for improving vision-language models.

The paper tackles the problem of suboptimal image-text alignment in Vision-Language Pre-training (VLP) by introducing IDEA, which uses online multi-label recognition to extract image tags from texts, boosting performance on multiple downstream datasets with minimal extra computational cost.

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes