LGMar 23, 2022

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)

arXiv:2203.12221v1178 citationsh-index: 45
Originality Highly original
AI Analysis

This addresses a foundational problem in multi-modal deep learning by explaining why joint training fails, which is incremental as it builds on prior empirical observations.

The paper tackles the counter-intuitive performance gap where uni-modal networks outperform jointly trained multi-modal networks, proving theoretically that modality competition causes encoders to learn only a subset of modalities, leading to sub-optimality.

Despite the remarkable success of deep multi-modal learning in practice, it has not been well-explained in theory. Recently, it has been observed that the best uni-modal network outperforms the jointly trained multi-modal network, which is counter-intuitive since multiple signals generally bring more information. This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework. Based on a simplified data distribution that captures the realistic property of multi-modal data, we prove that for the multi-modal late-fusion network with (smoothed) ReLU activation trained jointly by gradient descent, different modalities will compete with each other. The encoder networks will learn only a subset of modalities. We refer to this phenomenon as modality competition. The losing modalities, which fail to be discovered, are the origins where the sub-optimality of joint training comes from. Experimentally, we illustrate that modality competition matches the intrinsic behavior of late-fusion joint training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes