CVOct 7, 2020

Universal Weighting Metric Learning for Cross-Modal Matching

arXiv:2010.03403v1103 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses cross-modal matching for vision and language tasks, presenting an incremental improvement with a novel weighting method.

The paper tackles the problem of cross-modal matching by proposing a universal weighting framework to sample and weight informative pairs, introducing a new polynomial loss that defines separate weight functions for positive and negative pairs, and demonstrates efficacy on image-text and video-text benchmarks.

Cross-modal matching has been a highlighted research topic in both vision and language areas. Learning appropriate mining strategy to sample and weight informative pairs is crucial for the cross-modal matching performance. However, most existing metric learning methods are developed for unimodal matching, which is unsuitable for cross-modal matching on multimodal data with heterogeneous features. To address this problem, we propose a simple and interpretable universal weighting framework for cross-modal matching, which provides a tool to analyze the interpretability of various loss functions. Furthermore, we introduce a new polynomial loss under the universal weighting framework, which defines a weight function for the positive and negative informative pairs respectively. Experimental results on two image-text matching benchmarks and two video-text matching benchmarks validate the efficacy of the proposed method.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes