CV AIJan 11, 2024

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Wei Ye, Chaoya Jiang, Haiyang Xu, Chenhao Ye, Chenliang Li, Ming Yan, Shikun Zhang, Songhang Huang, Fei Huang

arXiv:2403.07883v13.71 citationsh-index: 28

Originality Incremental advance

AI Analysis

This addresses efficiency issues for researchers and practitioners using large-scale VLP models, though it is incremental as it builds on existing ViT-based approaches.

The paper tackles computational inefficiency in Vision-and-Language Pre-training (VLP) models by introducing TRIPS, a method that selects text-relevant image patches to reduce visual sequence length, resulting in a 40% speedup while maintaining or improving performance on downstream tasks.

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes. This patch-selection layer dynamically computes text-dependent visual attention, enabling it to identify attentive image tokens with text guidance and fuse inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.

View on arXiv PDF

Similar