STAug 4, 2023
Learning Networks from Gaussian Graphical Models and Gaussian Free FieldsSubhro Ghosh, Soumendu Sundar Mukherjee, Hoang-Son Tran et al.
We investigate the problem of estimating the structure of a weighted network from repeated measurements of a Gaussian Graphical Model (GGM) on the network. In this vein, we consider GGMs whose covariance structures align with the geometry of the weighted network on which they are based. Such GGMs have been of longstanding interest in statistical physics, and are referred to as the Gaussian Free Field (GFF). In recent years, they have attracted considerable interest in the machine learning and theoretical computer science. In this work, we propose a novel estimator for the weighted network (equivalently, its Laplacian) from repeated measurements of a GFF on the network, based on the Fourier analytic properties of the Gaussian distribution. In this pursuit, our approach exploits complex-valued statistics constructed from observed data, that are of interest on their own right. We demonstrate the effectiveness of our estimator with concrete recovery guarantees and bounds on the required sample complexity. In particular, we show that the proposed statistic achieves the parametric rate of estimation for fixed network size. In the setting of networks growing with sample size, our results show that for Erdos-Renyi random graphs $G(d,p)$ above the connectivity threshold, we demonstrate that network recovery takes place with high probability as soon as the sample size $n$ satisfies $n \gg d^4 \log d \cdot p^{-2}$.
4.2LGMay 20
Approximation Theory for Neural Networks: Old and NewSoumendu Sundar Mukherjee, Himasish Talukdar
Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.
LGFeb 24, 2023
Wasserstein Projection Pursuit of Non-Gaussian SignalsSatyaki Mukherjee, Soumendu Sundar Mukherjee, Debarghya Ghoshdastidar
We consider the general dimensionality reduction problem of locating in a high-dimensional data cloud, a $k$-dimensional non-Gaussian subspace of interesting features. We use a projection pursuit approach -- we search for mutually orthogonal unit directions which maximise the 2-Wasserstein distance of the empirical distribution of data-projections along these directions from a standard Gaussian. Under a generative model, where there is a underlying (unknown) low-dimensional non-Gaussian subspace, we prove rigorous statistical guarantees on the accuracy of approximating this unknown subspace by the directions found by our projection pursuit approach. Our results operate in the regime where the data dimensionality is comparable to the sample size, and thus supplement the recent literature on the non-feasibility of locating interesting directions via projection pursuit in the complementary regime where the data dimensionality is much larger than the sample size.
STDec 13, 2023
Minimax-optimal estimation for sparse multi-reference alignment with collision-free signalsSubhro Ghosh, Soumendu Sundar Mukherjee, Jing Bin Pan
The Multi-Reference Alignment (MRA) problem aims at the recovery of an unknown signal from repeated observations under the latent action of a group of cyclic isometries, in the presence of additive noise of high intensity $σ$. It is a more tractable version of the celebrated cryo EM model. In the crucial high noise regime, it is known that its sample complexity scales as $σ^6$. Recent investigations have shown that for the practically significant setting of sparse signals, the sample complexity of the maximum likelihood estimator asymptotically scales with the noise level as $σ^4$. In this work, we investigate minimax optimality for signal estimation under the MRA model for so-called collision-free signals. In particular, this signal class covers the setting of generic signals of dilute sparsity (wherein the support size $s=O(L^{1/3})$, where $L$ is the ambient dimension. We demonstrate that the minimax optimal rate of estimation in for the sparse MRA problem in this setting is $σ^2/\sqrt{n}$, where $n$ is the sample size. In particular, this widely generalizes the sample complexity asymptotics for the restricted MLE in this setting, establishing it as the statistically optimal estimator. Finally, we demonstrate a concentration inequality for the restricted MLE on its deviations from the ground truth.
LGFeb 28, 2025
Optimal Transfer Learning for Missing Not-at-Random Matrix CompletionAkhil Jalan, Yassir Jedra, Arya Mazumdar et al.
We study transfer learning for matrix completion in a Missing Not-at-Random (MNAR) setting that is motivated by biological problems. The target matrix $Q$ has entire rows and columns missing, making estimation impossible without side information. To address this, we use a noisy and incomplete source matrix $P$, which relates to $Q$ via a feature shift in latent space. We consider both the active and passive sampling of rows and columns. We establish minimax lower bounds for entrywise estimation error in each setting. Our computationally efficient estimation framework achieves this lower bound for the active setting, which leverages the source data to query the most informative rows and columns of $Q$. This avoids the need for incoherence assumptions required for rate optimality in the passive sampling setting. We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets.
LGFeb 27, 2024
Implicit Regularization via Spectral Neural Networks and Non-linear Matrix SensingHong T. M. Chu, Subhro Ghosh, Chi Thanh Lam et al.
The phenomenon of implicit regularization has attracted interest in recent years as a fundamental aspect of the remarkable generalizing ability of neural networks. In a nutshell, it entails that gradient descent dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. However, known results attempting to theoretically explain this phenomenon focus overwhelmingly on the setting of linear neural nets, and the simplicity of the linear structure is particularly crucial to existing arguments. In this paper, we explore this problem in the context of more realistic neural networks with a general class of non-linear activation functions, and rigorously demonstrate the implicit regularization phenomenon for such networks in the setting of matrix sensing problems, together with rigorous rate guarantees that ensure exponentially fast convergence of gradient descent.In this vein, we contribute a network architecture called Spectral Neural Networks (abbrv. SNN) that is particularly suitable for matrix learning problems. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries, a potentially fruitful perspective for matrix learning. We demonstrate that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets and confirm its effectiveness in the context of matrix sensing, via both mathematical guarantees and empirical investigations. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.
MLJul 20, 2025
Learning under Latent Group Sparsity via Diffusion on NetworksSubhroshekhar Ghosh, Soumendu Sundar Mukherjee
Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to sparse learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat-flow-based local network dynamics. The proposed penalty interpolates between the lasso and the group lasso penalties, the runtime of the heat-flow dynamics being the interpolating parameter. As such it can automatically default to lasso when the group structure reflected in the Laplacian is weak. In fact, we demonstrate a data-driven procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the diffusion for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.
MLDec 27, 2024
A new approach to locally adaptive polynomial regressionSabyasachi Chatterjee, Subhajit Goswami, Soumendu Sundar Mukherjee
Adaptive bandwidth selection is a fundamental challenge in nonparametric regression. This paper introduces a new bandwidth selection procedure inspired by the optimality criteria for $\ell_0$-penalized regression. Although similar in spirit to Lepski's method and its variants in selecting the largest interval satisfying an admissibility criterion, our approach stems from a distinct philosophy, utilizing criteria based on $\ell_2$-norms of interval projections rather than explicit point and variance estimates. We obtain non-asymptotic risk bounds for the local polynomial regression methods based on our bandwidth selection procedure which adapt (near-)optimally to the local Hölder exponent of the underlying regression function simultaneously at all points in its domain. Furthermore, we show that there is a single ideal choice of a global tuning parameter in each case under which the above-mentioned local adaptivity holds. The optimal risks of our methods derive from the properties of solutions to a new ``bandwidth selection equation'' which is of independent interest. We believe that the principles underlying our approach provide a new perspective to the classical yet ever relevant problem of locally adaptive nonparametric regression.
LGJun 5, 2024
Transfer Learning for Latent Variable Network ModelsAkhil Jalan, Arya Mazumdar, Soumendu Sundar Mukherjee et al.
We study transfer learning for estimation in latent variable network models. In our setting, the conditional edge probability matrices given the latent variables are represented by $P$ for the source and $Q$ for the target. We wish to estimate $Q$ given two kinds of data: (1) edge data from a subgraph induced by an $o(1)$ fraction of the nodes of $Q$, and (2) edge data from all of $P$. If the source $P$ has no relation to the target $Q$, the estimation error must be $Ω(1)$. However, we show that if the latent variables are shared, then vanishing error is possible. We give an efficient algorithm that utilizes the ordering of a suitably defined graph distance. Our algorithm achieves $o(1)$ error and does not assume a parametric form on the source or target networks. Next, for the specific case of Stochastic Block Models we prove a minimax lower bound and show that a simple algorithm achieves this rate. Finally, we empirically demonstrate our algorithm's use on real-world and simulated graph transfer problems.
MEJan 20, 2022
Learning with latent group sparsity via heat flow dynamics on networksSubhroshekhar Ghosh, Soumendu Sundar Mukherjee
Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat flow-based local network dynamics. In fact, we demonstrate a procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the heat flow dynamics for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. We validate our approach by successful applications to real-world data from a wide array of application domains, including computer science, genetics, climatology and economics. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.
CLNov 29, 2021
Changepoint Analysis of Topic Proportions in Temporal Text DataAvinandan Bose, Soumendu Sundar Mukherjee
Changepoint analysis deals with unsupervised detection and/or estimation of time-points in time-series data, when the distribution generating the data changes. In this article, we consider \emph{offline} changepoint detection in the context of large scale textual data. We build a specialised temporal topic model with provisions for changepoints in the distribution of topic proportions. As full likelihood based inference in this model is computationally intractable, we develop a computationally tractable approximate inference procedure. More specifically, we use sample splitting to estimate topic polytopes first and then apply a likelihood ratio statistic together with a modified version of the wild binary segmentation algorithm of Fryzlewicz et al. (2014). Our methodology facilitates automated detection of structural changes in large corpora without the need of manual processing by domain experts. As changepoints under our model correspond to changes in topic structure, the estimated changepoints are often highly interpretable as marking the surge or decline in popularity of a fashionable topic. We apply our procedure on two large datasets: (i) a corpus of English literature from the period 1800-1922 (Underwoodet al., 2015); (ii) abstracts from the High Energy Physics arXiv repository (Clementet al., 2019). We obtain some historically well-known changepoints and discover some new ones.
MEAug 20, 2020
Exact Tests for Offline Changepoint Detection in Multichannel Binary and Count Data with Application to NetworksShyamal K. De, Soumendu Sundar Mukherjee
We consider offline detection of a single changepoint in binary and count time-series. We compare exact tests based on the cumulative sum (CUSUM) and the likelihood ratio (LR) statistics, and a new proposal that combines exact two-sample conditional tests with multiplicity correction, against standard asymptotic tests based on the Brownian bridge approximation to the CUSUM statistic. We see empirically that the exact tests are much more powerful in situations where normal approximations driving asymptotic tests are not trustworthy: (i) small sample settings; (ii) sparse parametric settings; (iii) time-series with changepoint near the boundary. We also consider a multichannel version of the problem, where channels can have different changepoints. Controlling the False Discovery Rate (FDR), we simultaneously detect changes in multiple channels. This "local" approach is shown to be more advantageous than multivariate global testing approaches when the number of channels with changepoints is much smaller than the total number of channels. As a natural application, we consider network-valued time-series and use our approach with (a) edges as binary channels and (b) node-degrees or other local subgraph statistics as count channels. The local testing approach is seen to be much more informative than global network changepoint algorithms.
MLJun 2, 2019
Graphon Estimation from Partially Observed Network DataSoumendu Sundar Mukherjee, Sayak Chakrabarti
We consider estimating the edge-probability matrix of a network generated from a graphon model when the full network is not observed---only some overlapping subgraphs are. We extend the neighbourhood smoothing (NBS) algorithm of Zhang et al. (2017) to this missing-data set-up and show experimentally that, for a wide range of graphons, the extended NBS algorithm achieves significantly smaller error rates than standard graphon estimation algorithms such as vanilla neighbourhood smoothing (NBS), universal singular value thresholding (USVT), blockmodel approximation, matrix completion, etc. We also show that the extended NBS algorithm is much more robust to missing data.
LGJan 1, 2019
Morphological Network: How Far Can We Go with Morphological Neurons?Ranjan Mondal, Sanchayan Santra, Soumendu Sundar Mukherjee et al.
Morphological neurons, that is morphological operators such as dilation and erosion with learnable structuring elements, have intrigued researchers for quite some time because of the power these operators bring to the table despite their simplicity. These operators are known to be powerful nonlinear tools, but for a given problem coming up with a sequence of operations and their structuring element is a non-trivial task. So, the existing works have mainly focused on this part of the problem without delving deep into their applicability as generic operators. A few works have tried to utilize morphological neurons as a part of classification (and regression) networks when the input is a feature vector. However, these methods mainly focus on a specific problem, without going into generic theoretical analysis. In this work, we have theoretically analyzed morphological neurons and have shown that these are far more powerful than previously anticipated. Our proposed morphological block, containing dilation and erosion followed by their linear combination, represents a sum of hinge functions. Existing works show that hinge functions perform quite well in classification and regression problems. Two morphological blocks can even approximate any continuous function. However, to facilitate the theoretical analysis that we have done in this paper, we have restricted ourselves to the 1D version of the operators, where the structuring element operates on the whole input. Experimental evaluations also indicate the effectiveness of networks built with morphological neurons, over similarly structured neural networks.
MLAug 18, 2017
Two provably consistent divide and conquer clustering algorithms for large networksSoumendu Sundar Mukherjee, Purnamrita Sarkar, Peter J. Bickel
In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms which perform clustering on a number of small subgraphs and finally patches the results into a single clustering. The main advantage of these algorithms is that they bring down significantly the computational cost of traditional algorithms, including spectral clustering, semi-definite programs, modularity based methods, likelihood based methods etc., without losing on accuracy and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Thus, exploiting the facts that most traditional algorithms are accurate and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.
MLJun 8, 2016
On clustering network-valued dataSoumendu Sundar Mukherjee, Purnamrita Sarkar, Lizhen Lin
Community detection, which focuses on clustering nodes or detecting communities in (mostly) a single network, is a problem of considerable practical interest and has received a great deal of attention in the research community. While being able to cluster within a network is important, there are emerging needs to be able to cluster multiple networks. This is largely motivated by the routine collection of network data that are generated from potentially different populations. These networks may or may not have node correspondence. When node correspondence is present, we cluster networks by summarizing a network by its graphon estimate, whereas when node correspondence is not present, we propose a novel solution for clustering such networks by associating a computationally feasible feature vector to each network based on trace of powers of the adjacency matrix. We illustrate our methods using both simulated and real data sets, and theoretical justifications are provided in terms of consistency.