CVAILGMMMay 9, 2023

ImageBind: One Embedding Space To Bind Them All

arXiv:2305.05665v21573 citations
Originality Highly original
AI Analysis

It addresses the problem of multimodal integration for AI researchers and practitioners by providing a unified embedding space, though it builds incrementally on existing vision-language models.

ImageBind learns a joint embedding across six modalities using only image-paired data, enabling zero-shot capabilities and setting new state-of-the-art on emergent recognition tasks, outperforming specialist supervised models.

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes