LGCLCVDec 14, 2023

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

arXiv:2312.08846v46 citationsh-index: 28AAAI
Originality Incremental advance
AI Analysis

This work addresses data efficiency for vision-language pre-training models, offering a computationally viable solution that could benefit broader adoption in practical scenarios, though it is incremental as it builds on existing mix-based augmentation techniques.

The paper tackles the problem of data inefficiency and high computational cost in scaling up vision-language pre-training (VLP) by proposing TiMix, a text-aware image mixing method that integrates mix-based data augmentation into self-supervised multi-modal contrastive learning, achieving comparable performance on downstream tasks with reduced training data and shorter training time.

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes