Global and Local Entailment Learning for Natural World Imagery
This addresses the problem of representing hierarchical biological data (e.g., Tree of Life) for researchers in computational biology and ecology, though it appears incremental as it builds on prior entailment learning approaches.
The paper tackles the challenge of learning hierarchical structure in vision-language models by introducing Radial Cross-Modal Embeddings (RCME), a framework that explicitly models transitivity-enforced entailment to optimize concept partial order. The result is a hierarchical vision-language foundation model that outperforms state-of-the-art models on hierarchical species classification and retrieval tasks.
Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.