CVMay 21, 2024

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

arXiv:2405.12752v21 citationsh-index: 18Has CodeIJCAI
Originality Incremental advance
AI Analysis

This addresses a bottleneck in training large vision-language models for researchers, but it is incremental as it builds on existing data generation methods.

The paper tackles the problem of generating vision-language instruction tuning data with low content relevance and exposure bias by proposing C3L, which uses a content relevance module and contrastive learning to improve data quality, achieving effectiveness as shown by automatic measures on four benchmarks.

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes