ClusterChat: Multi-Feature Search for Corpus Exploration
This addresses the problem of inefficient corpus exploration for users in biomedical, finance, and legal domains, though it appears incremental as it builds on existing methods like embeddings and search.
The paper tackles the challenge of exploring large-scale text corpora in domains like biomedicine by introducing ClusterChat, an open-source system that integrates cluster-based organization with multi-feature search capabilities, and validates it on a four million abstract PubMed dataset to enhance context-aware insights while maintaining scalability.
Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.