Hierarchical Semantic Retrieval with Cobweb
This work addresses the need for more interpretable and robust retrieval systems in natural language processing, though it appears incremental as it builds on existing embedding methods with a novel hierarchical approach.
The paper tackled the problem of underutilized corpus structure and opaque explanations in neural document retrieval by introducing Cobweb, a hierarchy-aware framework that organizes sentence embeddings into a prototype tree for coarse-to-fine traversal, resulting in competitive effectiveness, improved robustness to embedding quality, and interpretable retrieval, with experiments showing it matches dot product search on strong encoder embeddings and remains robust when kNN degrades, such as with GPT-2 vectors where dot product performance collapses.
Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.