Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval
This work addresses the challenge of capturing semantic and structural information beyond visual similarity in image retrieval, which is incremental as it builds on existing contrastive loss methods but applies them to a novel hierarchical context.
The paper tackles the problem of learning visual hierarchies for image retrieval by introducing a paradigm that encodes user-defined multi-level hierarchies in hyperbolic space without explicit hierarchical labels, resulting in significant improvements in hierarchical retrieval tasks.
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.