CVCLMar 9, 2023

Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training

arXiv:2303.05313v21 citationsh-index: 42
AI Analysis

This addresses the challenge of fine-grained supervision in multi-modal pre-training for vision-language applications, offering a more efficient alternative to annotation-heavy methods.

The paper tackles the problem of fine-grained vision-language pre-training without costly object annotations by introducing a homonym sentence rewriting algorithm and a refined vision-language modeling framework, achieving superior performance on several downstream tasks.

Fine-grained supervision based on object annotations has been widely used for vision and language pre-training (VLP). However, in real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision. It is not only cost-expensive but also compute-expensive to collect object annotations and build object annotation pre-extractor for different scenarios. In this paper, we propose a fine-grained VLP scheme without object annotations from the linguistic perspective. First, we propose a homonym sentence rewriting (HSR) algorithm to provide token-level supervision. The algorithm replaces a verb/noun/adjective/quantifier word of the caption with its homonyms from WordNet. Correspondingly, we propose refined vision-language modeling (RVLM) framework to exploit the token-level supervision. Three refined tasks, i.e., refined image-text contrastive (RITC), refined image-text matching (RITM), and replace language modeling (RLM) are proposed to learn the fine-grained alignment. Extensive experiments on several downstream tasks demonstrate the superior performance of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes