IVCVMar 8, 2021

Multimodal Representation Learning via Maximization of Local Mutual Information

arXiv:2103.04537v547 citations
AI Analysis

This work addresses representation learning for multimodal data, specifically images and text, but appears incremental as it builds on existing mutual information estimation methods.

The paper tackles the problem of learning useful image representations by maximizing local mutual information between image and text features, resulting in improved performance on downstream image classification tasks.

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes