CV LG MMJun 22, 2023

Learning Unseen Modality Interaction

Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

arXiv:2306.12795v39.813 citationsh-index: 67

Originality Incremental advance

AI Analysis

This addresses a limitation in multimodal learning for applications where not all modalities are available during training, though it appears incremental as it builds on existing multimodal frameworks.

The paper tackles the problem of multimodal learning when some modality combinations are missing during training, aiming to generalize to unseen combinations during inference. It introduces a method that projects features into a common space and uses pseudo-supervision, showing effectiveness in tasks like video classification, robot state regression, and multimedia retrieval.

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.

View on arXiv PDF

Similar