CVMMMar 1, 2023

Selectively Hard Negative Mining for Alleviating Gradient Vanishing in Image-Text Matching

arXiv:2303.00181v111 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses a training stability issue for researchers and practitioners in multimodal AI, though it is incremental as it builds on existing methods.

The paper tackles gradient vanishing in image-text matching models by proposing a selectively hard negative mining strategy, which improves training stability and achieves state-of-the-art performance on benchmarks.

Recently, a series of Image-Text Matching (ITM) methods achieve impressive performance. However, we observe that most existing ITM models suffer from gradients vanishing at the beginning of training, which makes these models prone to falling into local minima. Most ITM models adopt triplet loss with Hard Negative mining (HN) as the optimization objective. We find that optimizing an ITM model using only the hard negative samples can easily lead to gradient vanishing. In this paper, we derive the condition under which the gradient vanishes during training. When the difference between the positive pair similarity and the negative pair similarity is close to 0, the gradients on both the image and text encoders will approach 0. To alleviate the gradient vanishing problem, we propose a Selectively Hard Negative Mining (SelHN) strategy, which chooses whether to mine hard negative samples according to the gradient vanishing condition. SelHN can be plug-and-play applied to existing ITM models to give them better training behavior. To further ensure the back-propagation of gradients, we construct a Residual Visual Semantic Embedding model with SelHN, denoted as RVSE++. Extensive experiments on two ITM benchmarks demonstrate the strength of RVSE++, achieving state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes