LGAIJun 8, 2021

What Makes Multi-modal Learning Better than Single (Provably)

arXiv:2106.04538v2373 citations
Originality Incremental advance
AI Analysis

This provides a foundational theoretical guarantee for multi-modal learning, addressing a key gap in the field.

The paper tackles the lack of theoretical justification for why multi-modal learning outperforms single-modal learning by proving that under a common fusion framework, using multiple modalities achieves a smaller population risk than using a subset, due to more accurate latent space estimation.

The world provides us with data of multiple modalities. Intuitively, models fusing data from different modalities outperform their uni-modal counterparts, since more information is aggregated. Recently, joining the success of deep learning, there is an influential line of work on deep multi-modal learning, which has remarkable empirical results on various applications. However, theoretical justifications in this field are notably lacking. Can multi-modal learning provably perform better than uni-modal? In this paper, we answer this question under a most popular multi-modal fusion framework, which firstly encodes features from different modalities into a common latent space and seamlessly maps the latent representations into the task space. We prove that learning with multiple modalities achieves a smaller population risk than only using its subset of modalities. The main intuition is that the former has a more accurate estimate of the latent space representation. To the best of our knowledge, this is the first theoretical treatment to capture important qualitative phenomena observed in real multi-modal applications from the generalization perspective. Combining with experiment results, we show that multi-modal learning does possess an appealing formal guarantee.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes