Hyperbolic Image-Text Representations
This addresses the need for better hierarchical modeling in multi-modal AI, though it is incremental as it builds on existing contrastive methods.
The authors tackled the problem that current vision-language models like CLIP do not explicitly capture hierarchical relationships between images and text, by proposing MERU, a contrastive model that uses hyperbolic spaces to embed such hierarchies. The result is that MERU learns interpretable and structured representations while achieving competitive performance with CLIP on tasks like image classification and retrieval.
Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru