LGAIMar 26

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

arXiv:2603.2679859.2h-index: 21
AI Analysis

For practitioners using VLMs, this work provides tools to inspect and improve the semantic organization of embedding spaces, but the approach is post-hoc and incremental.

The paper proposes a framework to explain, verify, and align semantic hierarchies in VLM embeddings, revealing that image encoders are more discriminative while text encoders better match human taxonomies, and showing a trade-off between zero-shot accuracy and ontological plausibility.

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes