CVMar 19, 2024

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

arXiv:2403.12532v158 citationsCVPR
Originality Incremental advance
AI Analysis

This addresses the need for more balanced and effective multi-modal representation learning for AI systems handling diverse data types, though it builds incrementally on existing CLIP-style approaches.

The paper tackles the problem of unbalanced representation spaces in multi-modal learning by introducing UniBind, which learns a unified and balanced representation space for seven diverse modalities using LLM-augmented embeddings. The method achieves an average 6.36% gain in zero-shot recognition over prior work and a 6.75% gain on ImageNet with 90% fewer learnable parameters.

We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities -- images, text, audio, point cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the downstream tasks, making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space, empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally, we achieve new state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes