CVLGNov 7, 2023

OmniVec: Learning robust representations with cross modal sharing

arXiv:2311.05709v190 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of designing separate networks for specific tasks across modalities, offering a unified approach that could benefit multimodal AI applications, though it appears incremental as it builds on existing self-supervised and sequential training methods.

The paper tackles the problem of learning across multiple modalities with a unified architecture, achieving state-of-the-art results on 22 diverse benchmarks through joint training that enables meaningful information sharing.

Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes