CLDec 19, 2024

ClusterChat: Multi-Feature Search for Corpus Exploration

arXiv:2412.14533v21 citationsh-index: 4Has CodeSIGIR
AI Analysis

This addresses the problem of inefficient corpus exploration for users in biomedical, finance, and legal domains, though it appears incremental as it builds on existing methods like embeddings and search.

The paper tackles the challenge of exploring large-scale text corpora in domains like biomedicine by introducing ClusterChat, an open-source system that integrates cluster-based organization with multi-feature search capabilities, and validates it on a four million abstract PubMed dataset to enhance context-aware insights while maintaining scalability.

Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes