CLDec 19, 2024

ClusterChat: Multi-Feature Search for Corpus Exploration

Ashish Chouhan, Saifeldin Mandour, Michael Gertz

arXiv:2412.14533v21.91 citationsh-index: 4Has CodeSIGIR

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient corpus exploration for users in biomedical, finance, and legal domains, though it appears incremental as it builds on existing methods like embeddings and search.

The paper tackles the challenge of exploring large-scale text corpora in domains like biomedicine by introducing ClusterChat, an open-source system that integrates cluster-based organization with multi-feature search capabilities, and validates it on a four million abstract PubMed dataset to enhance context-aware insights while maintaining scalability.

Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.

View on arXiv PDF Code

Similar