Chandra Sekhar Mukherjee

LG
h-index6
6papers
13citations
Novelty53%
AI Score33

6 Papers

LGApr 22, 2022
Capturing the Denoising Effect of PCA via Compression Ratio

Chandra Sekhar Mukherjee, Nikhil Doerkar, Jiapeng Zhang

Principal component analysis (PCA) is one of the most fundamental tools in machine learning with broad use as a dimensionality reduction and denoising tool. In the later setting, while PCA is known to be effective at subspace recovery and is proven to aid clustering algorithms in some specific settings, its improvement of noisy data is still not well quantified in general. In this paper, we propose a novel metric called \emph{compression ratio} to capture the effect of PCA on high-dimensional noisy data. We show that, for data with \emph{underlying community structure}, PCA significantly reduces the distance of data points belonging to the same community while reducing inter-community distance relatively mildly. We explain this phenomenon through both theoretical proofs and experiments on real-world data. Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a \emph{lower variance of compression ratio} do not share a \emph{common signal} with others (hence could be considered outliers). We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms. Our method is very competitive with popular outlier detection tools in this task.

LGMay 19, 2022
Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis

Yingcong Li, Chandra Sekhar Mukherjee, Jiapeng Zhang

Unsupervised clustering algorithms for vectors has been widely used in the area of machine learning. Many applications, including the biological data we studied in this paper, contain some boundary datapoints which show combination properties of two underlying clusters and could lower the performance of the traditional clustering algorithms. We develop a confident clustering method aiming to diminish the influence of these datapoints and improve the clustering results. Concretely, for a list of datapoints, we give two clustering results. The first-round clustering attempts to classify only pure vectors with high confidence. Based on it, we classify more vectors with less confidence in the second round. We validate our algorithm on single-cell RNA-seq data, which is a powerful and widely used tool in biology area. Our confident clustering shows a high accuracy on our tested datasets. In addition, unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.

LGJul 11, 2025
CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

Chandra Sekhar Mukherjee, Joonyoung Bae, Jiapeng Zhang

Density and geometry have long served as two of the fundamental guiding principles in clustering algorithm design, with algorithm usually focusing either on the density structure of the data (e.g., HDBSCAN and Density Peak Clustering) or the complexity of underlying geometry (e.g., manifold clustering algorithms). In this paper, we identify and formalize a recurring but often overlooked interaction between distribution and geometry and leverage this insight to design our clustering enhancement framework CoreSPECT (Core Space Projection-based Enhancement of Clustering Techniques). Our framework boosts the performance of simple algorithms like K-Means and GMM by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We apply our framework on 15 datasets from three different domains and obtain consistent and substantial gain in clustering accuracy for both K-Means and GMM. On average, our framework improves the ARI of K-Means by 40% and of GMM by 14%, often surpassing the performance of both manifold-based and recent density-based clustering algorithms. We further support our framework with initial theoretical guarantees, ablation to demonstrate the usefulness of the individual steps and with evidence of robustness to noise.

LGJun 6, 2024
A multi-core periphery perspective: Ranking via relative centrality

Chandra Sekhar Mukherjee, Jiapeng Zhang

Community and core-periphery are two widely studied graph structures, with their coexistence observed in real-world graphs (Rombach, Porter, Fowler \& Mucha [SIAM J. App. Math. 2014, SIAM Review 2017]). However, the nature of this coexistence is not well understood and has been pointed out as an open problem (Yanchenko \& Sengupta [Statistics Surveys, 2023]). Especially, the impact of inferring the core-periphery structure of a graph on understanding its community structure is not well utilized. In this direction, we introduce a novel quantification for graphs with ground truth communities, where each community has a densely connected part (the core), and the rest is more sparse (the periphery), with inter-community edges more frequent between the peripheries. Built on this structure, we propose a new algorithmic concept that we call relative centrality to detect the cores. We observe that core-detection algorithms based on popular centrality measures such as PageRank and degree centrality can show some bias in their outcome by selecting very few vertices from some cores. We show that relative centrality solves this bias issue and provide theoretical and simulation support, as well as experiments on real-world graphs. Core detection is known to have important applications with respect to core-periphery structures. In our model, we show a new application: relative-centrality-based algorithms can select a subset of the vertices such that it contains sufficient vertices from all communities, and points in this subset are better separable into their respective communities. We apply the methods to 11 biological datasets, with our methods resulting in a more balanced selection of vertices from all communities such that clustering algorithms have better performance on this set.

LGFeb 17, 2022
Recovering Unbalanced Communities in the Stochastic Block Model With Application to Clustering with a Faulty Oracle

Chandra Sekhar Mukherjee, Pan Peng, Jiapeng Zhang

The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. We improve upon a result of Ailon, Chen and Xu [ICML 2013; JMLR 2015] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in, and also remove the dependency of the size of the recoverable clusters on the number of underlying clusters. We further complement our theoretical improvements with experimental comparisons. Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to poly-logarithmic factors) when the probability parameters are constant. As a byproduct, we obtain an efficient clustering algorithm with sublinear query complexity in a faulty oracle model, which is capable of detecting all clusters larger than $\tildeΩ({\sqrt{n}})$, even in the presence of $Ω(n)$ small clusters in the graph. In contrast, previous efficient algorithms that use a sublinear number of queries are incapable of recovering any large clusters if there are more than $\tildeΩ(n^{2/5})$ small clusters.

CCJul 23, 2021
On Boolean Functions with Low Polynomial Degree and Higher Order Sensitivity

Subhamoy Maitra, Chandra Sekhar Mukherjee, Pantelimon Stanica et al.

Boolean functions are important primitives in different domains of cryptology, complexity and coding theory. In this paper, we connect the tools from cryptology and complexity theory in the domain of Boolean functions with low polynomial degree and high sensitivity. It is well known that the polynomial degree of of a Boolean function and its resiliency are directly connected. Using this connection we analyze the polynomial degree-sensitivity values through the lens of resiliency, demonstrating existence and non-existence results of functions with low polynomial degree and high sensitivity on small number of variables (upto 10). In this process, borrowing an idea from complexity theory, we show that one can implement resilient Boolean functions on a large number of variables with linear size and logarithmic depth. Finally, we extend the notion of sensitivity to higher order and note that the existing construction idea of Nisan and Szegedy (1994) can provide only constant higher order sensitivity when aiming for polynomial degree of $n-ω(1)$. In this direction, we present a construction with low ($n-ω(1)$) polynomial degree and super-constant $ω(1)$ order sensitivity exploiting Maiorana-McFarland constructions, that we borrow from construction of resilient functions. The questions we raise identify novel combinatorial problems in the domain of Boolean functions.