CVMar 3, 2020

Deep Multi-Modal Sets

arXiv:2003.01607v14 citations
AI Analysis

This addresses the problem of scalable and robust multi-modal learning for vision-related tasks, offering an incremental improvement over existing fusion methods.

The paper tackles the limitations of late fusion in multi-modal learning by proposing Deep Multi-Modal Sets, which represents features as an unordered set to handle varying modalities and cardinalities, achieving new state-of-the-art performance on Ads-Parallelity and MM-IMDb datasets.

Many vision-related tasks benefit from reasoning over multiple modalities to leverage complementary views of data in an attempt to learn robust embedding spaces. Most deep learning-based methods rely on a late fusion technique whereby multiple feature types are encoded and concatenated and then a multi layer perceptron (MLP) combines the fused embedding to make predictions. This has several limitations, such as an unnatural enforcement that all features be present at all times as well as constraining only a constant number of occurrences of a feature modality at any given time. Furthermore, as more modalities are added, the concatenated embedding grows. To mitigate this, we propose Deep Multi-Modal Sets: a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector. The set is constructed so that we have invariance both to permutations of the feature modalities as well as to the cardinality of the set. We will also show that with particular choices in our model architecture, we can yield interpretable feature performance such that during inference time we can observe which modalities are most contributing to the prediction.With this in mind, we demonstrate a scalable, multi-modal framework that reasons over different modalities to learn various types of tasks. We demonstrate new state-of-the-art performance on two multi-modal datasets (Ads-Parallelity [34] and MM-IMDb [1]).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes