MLLGDATA-ANMEOct 16, 2025

Reliable data clustering with Bayesian community detection

arXiv:2510.15013v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the problem of noise susceptibility in clustering for researchers in fields like neuroscience, genomics, and ecology, offering an incremental improvement over existing methods.

The paper tackles the problem of unreliable clustering in similarity data by proposing Bayesian community detection methods that combine sparsification and clustering with principled model selection. The results show these methods outperform traditional approaches in synthetic data under high-noise conditions and identify more robust gene modules in gene co-expression data compared to WGCNA.

From neuroscience and genomics to systems biology and ecology, researchers rely on clustering similarity data to uncover modular structure. Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise. A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results. To detect reliable clusters, we capitalize on recent advances in network science to unite sparsification and clustering with principled model selection. We test two Bayesian community detection methods, the Degree-Corrected Stochastic Block Model and the Regularized Map Equation, both grounded in the Minimum Description Length principle for model selection. In synthetic data, they outperform traditional approaches, detecting planted clusters under high-noise conditions and with fewer samples. Compared to WGCNA on gene co-expression data, the Regularized Map Equation identifies more robust and functionally coherent gene modules. Our results establish Bayesian community detection as a principled and noise-resistant framework for uncovering modular structure in high-dimensional data across fields.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes