CVAIIRLGJan 20, 2022

Omnivore: A Single Model for Many Visual Modalities

arXiv:2201.08377v2308 citations
AI Analysis

This addresses the need for unified models in computer vision, offering a practical solution for researchers and practitioners by enabling cross-modal recognition without modality-specific designs.

The paper tackles the problem of isolated architectures for different visual modalities by proposing a single transformer-based model, Omnivore, that achieves competitive performance on image, video, and 3D classification tasks, with results like 86.0% on ImageNet and 84.1% on Kinetics.

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes