Sayan Mukherjee

h-index4

28papers

259citations

Novelty46%

AI Score48

Ranked #53,651 of 201,326 authors (top 27%)#558 in ML (top 16%)

28 Papers

22.9DSMay 12

Congestion bounds via Laplacian eigenvalues and their application to tensor networks with arbitrary geometry

Sayan Mukherjee, Shinichiro Akiyama

Embedding the vertices of arbitrary graphs into trees while minimizing some measure of overlap is an important problem with applications in computer science and physics. In this work, we consider the problem of bijectively embedding the vertices of an $n$-vertex graph $G$ into the \textit{leaves} of an $n$-leaf \textit{rooted binary tree} $\mathcal{T}$. The congestion of such an embedding is given by the largest size of the cut induced by the two components obtained by deleting any vertex of $\mathcal{T}$. We show that for any embedding, the congestion lies between $λ_2(G)\cdot 2n/9$ and $λ_n(G)\cdot n/4$, letting $0=λ_1(G)\le \cdots \le λ_n(G)$ be the Laplacian eigenvalues of $G$, and there is an embedding for which the congestion is at most $λ_n(G)\cdot 2n/9$. Beyond these general bounds, we determine the congestion exactly for hypercubes and lattice graphs, and obtain asymptotically tight bounds for random regular graphs and Erdős-Rényi graphs. We further introduce an efficient contraction procedure based on spectral ordering and dynamic programming, which produces low-congestion embeddings in practice. Numerical experiments on structured graphs, random graphs, and tensor network representations of quantum circuits validate our theoretical bounds and demonstrate the effectiveness of the proposed method. These results yield new spectral bounds on the memory and time complexity of exact tensor network contraction in terms of the underlying graph structure.

MLMar 12, 2023

Global Optimality of Elman-type RNN in the Mean-Field Regime

Andrea Agazzi, Jianfeng Lu, Sayan Mukherjee

We analyze Elman-type Recurrent Reural Networks (RNNs) and their training in the mean-field regime. Specifically, we show convergence of gradient descent training dynamics of the RNN to the corresponding mean-field formulation in the large width limit. We also show that the fixed points of the limiting infinite-width dynamics are globally optimal, under some assumptions on the initialization of the weights. Our results establish optimality for feature-learning with wide RNNs in the mean-field regime

MLJun 6, 2023

Asymptotics of Bayesian Uncertainty Estimation in Random Features Regression

Youngsoo Baek, Samuel I. Berchuck, Sayan Mukherjee

In this paper we compare and contrast the behavior of the posterior predictive distribution to the risk of the maximum a posteriori estimator for the random features regression model in the overparameterized regime. We will focus on the variance of the posterior predictive distribution (Bayesian model average) and compare its asymptotics to that of the risk of the MAP estimator. In the regime where the model dimensions grow faster than any constant multiple of the number of samples, asymptotic agreement between these two quantities is governed by the phase transition in the signal-to-noise ratio. They also asymptotically agree with each other when the number of samples grow faster than any constant multiple of model dimensions. Numerical simulations illustrate finer distributional properties of the two quantities for finite dimensions. We conjecture they have Gaussian fluctuations and exhibit similar properties as found by previous authors in a Gaussian sequence model, which is of independent theoretical interest.

LGJun 22, 2022

Concentration inequalities and optimal number of layers for stochastic deep neural networks

Michele Caprio, Sayan Mukherjee

We state concentration inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.

LGFeb 22

Robust Predictive Uncertainty and Double Descent in Contaminated Bayesian Random Features

Michele Caprio, Katerina Papagiannouli, Siu Lun Chau et al. · oxford

We propose a robust Bayesian formulation of random feature (RF) regression that accounts explicitly for prior and likelihood misspecification via Huber-style contamination sets. Starting from the classical equivalence between ridge-regularized RF training and Bayesian inference with Gaussian priors and likelihoods, we replace the single prior and likelihood with $ε$- and $η$-contaminated credal sets, respectively, and perform inference using pessimistic generalized Bayesian updating. We derive explicit and tractable bounds for the resulting lower and upper posterior predictive densities. These bounds show that, when contamination is moderate, prior and likelihood ambiguity effectively acts as a direct contamination of the posterior predictive distribution, yielding uncertainty envelopes around the classical Gaussian predictive. We introduce an Imprecise Highest Density Region (IHDR) for robust predictive uncertainty quantification and show that it admits an efficient outer approximation via an adjusted Gaussian credible interval. We further obtain predictive variance bounds (under a mild truncation approximation for the upper bound) and prove that they preserve the leading-order proportional-growth asymptotics known for RF models. Together, these results establish a robustness theory for Bayesian random features: predictive uncertainty remains computationally tractable, inherits the classical double-descent phase structure, and is improved by explicit worst-case guarantees under bounded prior and likelihood misspecification.

9.5AIMay 14

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Debraj Chakraborty, Anirban Majumdar, Prince Mathew et al.

Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

APAug 24, 2019Code

Scalable Modeling of Spatiotemporal Data using the Variational Autoencoder: an Application in Glaucoma

Samuel I. Berchuck, Felipe A. Medeiros, Sayan Mukherjee

As big spatial data becomes increasingly prevalent, classical spatiotemporal (ST) methods often do not scale well. While methods have been developed to account for high-dimensional spatial objects, the setting where there are exceedingly large samples of spatial observations has had less attention. The variational autoencoder (VAE), an unsupervised generative model based on deep learning and approximate Bayesian inference, fills this void using a latent variable specification that is inferred jointly across the large number of samples. In this manuscript, we compare the performance of the VAE with a more classical ST method when analyzing longitudinal visual fields from a large cohort of patients in a prospective glaucoma study. Through simulation and a case study, we demonstrate that the VAE is a scalable method for analyzing ST data, when the goal is to obtain accurate predictions. R code to implement the VAE can be found on GitHub: https://github.com/berchuck/vaeST.

MLAug 3, 2024

Meta-Posterior Consistency for the Bayesian Inference of Metastable System

Zachary P Adams, Sayan Mukherjee

The vast majority of the literature on learning dynamical systems or stochastic processes from time series has focused on stable or ergodic systems, for both Bayesian and frequentist inference procedures. However, most real-world systems are only metastable, that is, the dynamics appear to be stable on some time scale, but are in fact unstable over longer time scales. Consistency of inference for metastable systems may not be possible, but one can ask about metaconsistency: Do inference procedures converge when observations are taken over a large but finite time interval, but diverge on longer time scales? In this paper we introduce, discuss, and quantify metaconsistency in a Bayesian framework. We discuss how metaconsistency can be exploited to efficiently infer a model for a sub-system of a larger system, where inference on the global behavior may require much more data, or there is no theoretical guarantee as to the asymptotic success of inference procedures. We also discuss the relation between metaconsistency and the spectral properties of the model dynamical system in the case of uniformly ergodic and non-ergodic diffusions.

AIDec 2, 2024

Approximately Optimal Search on a Higher-dimensional Sliding Puzzle

Nono SC Merleau, Miguel O'Malley, Érika Roldán et al.

Higher-dimensional sliding puzzles are constructed on the vertices of a $d$-dimensional hypercube, where $2^d-l$ vertices are distinctly coloured. Rings with the same colours are initially set randomly on the vertices of the hypercube. The goal of the puzzle is to move each of the $2^d-l$ rings to pre-defined target vertices on the cube. In this setting, the $k$-rule constraint represents a generalisation of edge collision for the movement of colours between vertices, allowing movement only when a hypercube face of dimension $k$ containing a ring is completely free of other rings. Starting from an initial configuration, what is the minimum number of moves needed to make ring colours match the vertex colours? An algorithm that provides us with such a number is called God's algorithm. When such an algorithm exists, it does not have a polynomial time complexity, at least in the case of the 15-puzzle corresponding to $k=1$ in the cubical puzzle. This paper presents a comprehensive computational study of different scenarios of the higher-dimensional puzzle. A benchmark of three computational techniques, an exact algorithm (the A* search) and two approximately optimal search techniques (an evolutionary algorithm (EA) and reinforcement learning (RL)) is presented in this work. The experiments show that all three methods can successfully solve the puzzle of dimension three for different face dimensions and across various difficulty levels. When the dimension increases, the A* search fails, and RL and EA methods can still provide a generally acceptable solution, i.e. a distribution of a number of moves with a median value of less than $30$. Overall, the EA method consistently requires less computational time, while failing in most cases to minimise the number of moves for the puzzle dimensions $d=4$ and $d=5$.

AIMay 7, 2024

Pragmatist Intelligence: Where the Principle of Usefulness Can Take ANNs

Antonio Bikić, Sayan Mukherjee

Artificial neural networks (ANNs) perform extraordinarily on numerous tasks including classification or prediction, e.g., speech processing and image classification. These new functions are based on a computational model that is enabled to select freely all necessary internal model parameters as long as it eventually delivers the functionality it is supposed to exhibit. Here, we review the connection between the model parameter selection in machine learning (ML) algorithms running on ANNs and the epistemological theory of neopragmatism focusing on the theory's utility and anti-representationalist aspects. To understand the consequences of the model parameter selection of an ANN, we suggest using neopragmatist theories whose implications are well studied. Incidentally, neopragmatism's notion of optimization is also based on utility considerations. This means that applying this approach elegantly reveals the inherent connections between optimization in ML, using a numerical method during the learning phase, and optimization in the ethical theory of consequentialism, where it occurs as a maxim of action. We suggest that these connections originate from the way relevance is calculated in ML systems. This could ultimately reveal a tendency for specific actions in ML systems.

LGDec 15, 2021

Tight query complexity bounds for learning graph partitions

Xizhi Liu, Sayan Mukherjee

Given a partition of a graph into connected components, the membership oracle asserts whether any two vertices of the graph lie in the same component or not. We prove that for $n\ge k\ge 2$, learning the components of an $n$-vertex hidden graph with $k$ components requires at least $(k-1)n-\binom k2$ membership queries. Our result improves on the best known information-theoretic bound of $Ω(n\log k)$ queries, and exactly matches the query complexity of the algorithm introduced by [Reyzin and Srivastava, 2007] for this problem. Additionally, we introduce an oracle, with access to which one can learn the number of components of $G$ in asymptotically fewer queries than learning the full partition, thus answering another question posed by the same authors. Lastly, we introduce a more applicable version of this oracle, and prove asymptotically tight bounds of $\widetildeΘ(m)$ queries for both learning and verifying an $m$-edge hidden graph $G$ using it.

ARAug 2, 2021

Accelerating Markov Random Field Inference with Uncertainty Quantification

Ramin Bashizade, Xiangyu Zhang, Sayan Mukherjee et al.

Statistical machine learning has widespread application in various domains. These methods include probabilistic algorithms, such as Markov Chain Monte-Carlo (MCMC), which rely on generating random numbers from probability distributions. These algorithms are computationally expensive on conventional processors, yet their statistical properties, namely interpretability and uncertainty quantification (UQ) compared to deep learning, make them an attractive alternative approach. Therefore, hardware specialization can be adopted to address the shortcomings of conventional processors in running these applications. In this paper, we propose a high-throughput accelerator for Markov Random Field (MRF) inference, a powerful model for representing a wide range of applications, using MCMC with Gibbs sampling. We propose a tiled architecture which takes advantage of near-memory computing, and memory optimizations tailored to the semantics of MRF. Additionally, we propose a novel hybrid on-chip/off-chip memory system and logging scheme to efficiently support UQ. This memory system design is not specific to MRF models and is applicable to applications using probabilistic algorithms. In addition, it dramatically reduces off-chip memory bandwidth requirements. We implemented an FPGA prototype of our proposed architecture using high-level synthesis tools and achieved 146MHz frequency for an accelerator with 32 function units on an Intel Arria 10 FPGA. Compared to prior work on FPGA, our accelerator achieves 26X speedup. Furthermore, our proposed memory system and logging scheme to support UQ reduces off-chip bandwidth by 71% for two applications. ASIC analysis in 15nm shows our design with 2048 function units running at 3GHz outperforms GPU implementations of motion estimation and stereo vision on Nvidia RTX2080Ti by 120X-210X, occupying only 7.7% of the area.

SDMay 31, 2021

A Methodology for Exploring Deep Convolutional Features in Relation to Hand-Crafted Features with an Application to Music Audio Modeling

Anna K. Yanchenko, Mohammadreza Soltani, Robert J. Ravier et al.

Understanding the features learned by deep models is important from a model trust perspective, especially as deep systems are deployed in the real world. Most recent approaches for deep feature understanding or model explanation focus on highlighting input data features that are relevant for classification decisions. In this work, we instead take the perspective of relating deep features to well-studied, hand-crafted features that are meaningful for the application of interest. We propose a methodology and set of systematic experiments for exploring deep features in this setting, where input feature importance approaches for deep feature understanding do not apply. Our experiments focus on understanding which hand-crafted and deep features are useful for the classification task of interest, how robust these features are for related tasks and how similar the deep features are to the meaningful hand-crafted features. Our proposed method is general to many application areas and we demonstrate its utility on orchestral music audio data.

MLDec 14, 2020

At the Intersection of Deep Sequential Model Framework and State-space Model Framework: Study on Option Pricing

Ziyang Ding, Sayan Mukherjee

Inference and forecast problems of the nonlinear dynamical system have arisen in a variety of contexts. Reservoir computing and deep sequential models, on the one hand, have demonstrated efficient, robust, and superior performance in modeling simple and chaotic dynamical systems. However, their innate deterministic feature has partially detracted their robustness to noisy system, and their inability to offer uncertainty measurement has also been an insufficiency of the framework. On the other hand, the traditional state-space model framework is robust to noise. It also carries measured uncertainty, forming a just-right complement to the reservoir computing and deep sequential model framework. We propose the unscented reservoir smoother, a model that unifies both deep sequential and state-space models to achieve both frameworks' superiorities. Evaluated in the option pricing setting on top of noisy datasets, URS strikes highly competitive forecasting accuracy, especially those of longer-term, and uncertainty measurement. Further extensions and implications on URS are also discussed to generalize a full integration of both frameworks.

MLJun 11, 2020

Stanza: A Nonlinear State Space Model for Probabilistic Inference in Non-Stationary Time Series

Anna K. Yanchenko, Sayan Mukherjee

Time series with long-term structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting settings. Traditionally, state space models have been successful in providing uncertainty estimates of trajectories in the latent space. More recently, deep learning, attention-based approaches have achieved state of the art performance for sequence modeling, though often require large amounts of data and parameters to do so. We propose Stanza, a nonlinear, non-stationary state space model as an intermediate approach to fill the gap between traditional models and modern deep learning approaches for complex time series. Stanza strikes a balance between competitive forecasting accuracy and probabilistic, interpretable inference for highly structured time series. In particular, Stanza achieves forecasting accuracy competitive with deep LSTMs on real-world datasets, especially for multi-step ahead forecasting.

SPMar 5, 2020

Beyond Application End-Point Results: Quantifying Statistical Robustness of MCMC Accelerators

Xiangyu Zhang, Ramin Bashizade, Yicheng Wang et al.

Statistical machine learning often uses probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated with specialized hardware by exploiting parallelism and optimizing the design using various approximation techniques. Current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality ("accuracy"). It is important for hardware designers and domain experts to look beyond end-point "accuracy" and be aware of the hardware optimizations impact on other statistical properties. This work takes a first step towards defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators beyond end-point result quality. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator and surface design issues that cannot be exposed using only application end-point result quality. Applying the framework to guide design space exploration shows that statistical robustness comparable to floating-point software can be achieved by slightly increasing the bit representation, without floating-point hardware requirements.

SPOct 27, 2019

A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators

Xiangyu Zhang, Sayan Mukherjee, Alvin R. Lebeck

Statistical machine learning often uses probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Many accelerators are proposed using specialized hardware to address sampling inefficiency, the critical performance bottleneck of probabilistic algorithms. These accelerators usually improve the hardware efficiency by using some approximation techniques, such as reducing bit representation, truncating small values to zero, or simplifying the Random Number Generator (RNG). Understanding the influence of these approximations on result quality is crucial to meeting the quality requirements of real applications. Although a common approach is to compare the end-point result quality using community-standard benchmarks and metrics, we claim a probabilistic architecture should provide some measure (or guarantee) of statistical robustness. This work takes a first step towards quantifying the statistical robustness of specialized hardware MCMC accelerators by proposing three pillars of statistical robustness: sampling quality, convergence diagnostic, and goodness of fit. Each pillar has at least one quantitative metric without the need to know the ground truth data. We apply this method to analyze the statistical robustness of an MCMC accelerator proposed by previous work, with some modifications, as a case study. The method also applies to other probabilistic accelerators and can be used in design space exploration.

COJul 2, 2019

Adaptive particle-based approximations of the Gibbs posterior for inverse problems

Zilong Zou, Sayan Mukherjee, Harbir Antil et al.

In this work, we adopt a general framework based on the Gibbs posterior to update belief distributions for inverse problems governed by partial differential equations (PDEs). The Gibbs posterior formulation is a generalization of standard Bayesian inference that only relies on a loss function connecting the unknown parameters to the data. It is particularly useful when the true data generating mechanism (or noise distribution) is unknown or difficult to specify. The Gibbs posterior coincides with Bayesian updating when a true likelihood function is known and the loss function corresponds to the negative log-likelihood, yet provides subjective inference in more general settings. We employ a sequential Monte Carlo (SMC) approach to approximate the Gibbs posterior using particles. To manage the computational cost of propagating increasing numbers of particles through the loss function, we employ a recently developed local reduced basis method to build an efficient surrogate loss function that is used in the Gibbs update formula in place of the true loss. We derive error bounds for our approximation and propose an adaptive approach to construct the surrogate model in an efficient manner. We demonstrate the efficiency of our approach through several numerical examples.

MLNov 15, 2018

Subspace Clustering through Sub-Clusters

Weiwei Li, Jan Hannig, Sayan Mukherjee

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.

MLMar 12, 2018

Scalable Algorithms for Learning High-Dimensional Linear Mixed Models

Zilong Tan, Kimberly Roche, Xiang Zhou et al.

Linear mixed models (LMMs) are used extensively to model dependecies of observations in linear regression and are used extensively in many application areas. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do not offer theoretical guarantees. We present scalable algorithms for learning high-dimensional LMMs with sublinear computational complexity dependence on $p$. Key to our approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform. We provide theoretical guarantees for our learning algorithms, demonstrating the robustness of parameter estimation. Finally, we complement the theory with experiments on large synthetic and real data.

MLFeb 21, 2018

Learning Integral Representations of Gaussian Processes

Zilong Tan, Sayan Mukherjee

We propose a representation of Gaussian processes (GPs) based on powers of the integral operator defined by a kernel function, we call these stochastic processes integral Gaussian processes (IGPs). Sample paths from IGPs are functions contained within the reproducing kernel Hilbert space (RKHS) defined by the kernel function, in contrast sample paths from the standard GP are not functions within the RKHS. We develop computationally efficient non-parametric regression models based on IGPs. The main innovation in our regression algorithm is the construction of a low dimensional subspace that captures the information most relevant to explaining variation in the response. We use ideas from supervised dimension reduction to compute this subspace. The result of using the construction we propose involves significant improvements in the computational complexity of estimating kernel hyper-parameters as well as reducing the prediction variance.

SDAug 12, 2017

Classical Music Composition Using State Space Models

Anna K. Yanchenko, Sayan Mukherjee

Algorithmic composition of music has a long history and with the development of powerful deep learning methods, there has recently been increased interest in exploring algorithms and models to create art. We explore the utility of state space models, in particular hidden Markov models (HMMs) and variants, in composing classical piano pieces from the Romantic era and consider the models' ability to generate new pieces that sound like they were composed by a human. We find that the models we explored are fairly successful at generating new pieces that have largely consonant harmonies, especially when trained on original pieces with simple harmonic structure. However, we conclude that the major limitation in using these models to generate music that sounds like it was composed by a human is the lack of melodic progression in the composed pieces. We also examine the performance of the models in the context of music theory.

LGFeb 25, 2017

Efficient Learning of Mixed Membership Models

Zilong Tan, Sayan Mukherjee

We present an efficient algorithm for learning mixed membership models when the number of variables $p$ is much larger than the number of hidden components $k$. This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an $O\left(p^3\right)$ tensor, to factorizing $O\left(p/k\right)$ sub-tensors each of size $O\left(k^3\right)$. In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.

STMar 17, 2016

Fast moment estimation for generalized latent Dirichlet models

Shiwen Zhao, Barbara E. Engelhardt, Sayan Mukherjee et al.

We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has been demonstrated to have computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. The key computational advan- tage of our method (MELD) is that parameter estimation does not require instantiation of the latent variables. Moreover, a representational advantage of the GMM approach is that the behavior of the model is agnostic to distributional assumptions of the observations. We derive population moment conditions after marginalizing out the sample-specific Dirichlet latent variables. The moment conditions only depend on component mean parameters. We illustrate the utility of our approach on simulated data, comparing results from MELD to alternative methods, and we show the promise of our approach through the application of MELD to several data sets.

MEAug 5, 2015

Bayesian Approximate Kernel Regression with Variable Selection

Lorin Crawford, Kris C. Wood, Xiang Zhou et al.

Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size analog of each explanatory variable for Bayesian kernel regression models when the kernel is shift-invariant --- for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e. phenotypic prediction) and association mapping (i.e. inference of significant variants or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings.

MLApr 13, 2015

Adaptive Randomized Dimension Reduction on Massive Data

Gregory Darnell, Stoyan Georgiev, Sayan Mukherjee et al.

The scalability of statistical estimators is of increasing importance in modern applications. One approach to implementing scalable algorithms is to compress data into a low dimensional latent space using dimension reduction methods. In this paper we develop an approach for dimension reduction that exploits the assumption of low rank structure in high dimensional data to gain both computational and statistical advantages. We adapt recent randomized low-rank approximation algorithms to provide an efficient solution to principal component analysis (PCA), and we use this efficient solver to improve parameter estimation in large-scale linear mixed models (LMM) for association mapping in statistical and quantitative genomics. A key observation in this paper is that randomization serves a dual role, improving both computational and statistical performance by implicitly regularizing the covariance matrix estimate of the random effect in a LMM. These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies.

MENov 11, 2014

Bayesian group latent factor analysis with structured sparsity

Shiwen Zhao, Chuan Gao, Sayan Mukherjee et al.

Latent factor models are the canonical statistical tool for exploratory analyses of low-dimensional linear structure for an observation matrix with p features across n samples. We develop a structured Bayesian group factor analysis model that extends the factor model to multiple coupled observation matrices; in the case of two observations, this reduces to a Bayesian model of canonical correlation analysis. The main contribution of this work is to carefully define a structured Bayesian prior that encourages both element-wise and column-wise shrinkage and leads to desirable behavior on high-dimensional data. In particular, our model puts a structured prior on the joint factor loading matrix, regularizing at three levels, which enables element-wise sparsity and unsupervised recovery of latent factors corresponding to structured variance across arbitrary subsets of the observations. In addition, our structured prior allows for both dense and sparse latent factors so that covariation among either all features or only a subset of features can both be recovered. We use fast parameter-expanded expectation-maximization for parameter estimation in this model. We validate our method on both simulated data with substantial structure and real data, comparing against a number of state-of-the-art approaches. These results illustrate useful properties of our model, including i) recovering sparse signal in the presence of dense effects; ii) the ability to scale naturally to large numbers of observations; iii) flexible observation- and factor-specific regularization to recover factors with a wide variety of sparsity levels and percentage of variance explained; and iv) tractable inference that scales to modern genomic and document data sizes.

MLOct 29, 2013

The Information Geometry of Mirror Descent

Garvesh Raskutti, Sayan Mukherjee

Information geometry applies concepts in differential geometry to probability and statistics and is especially useful for parameter estimation in exponential families where parameters are known to lie on a Riemannian manifold. Connections between the geometric properties of the induced manifold and statistical properties of the estimation problem are well-established. However developing first-order methods that scale to larger problems has been less of a focus in the information geometry community. The best known algorithm that incorporates manifold structure is the second-order natural gradient descent algorithm introduced by Amari. On the other hand, stochastic approximation methods have led to the development of first-order methods for optimizing noisy objective functions. A recent generalization of the Robbins-Monro algorithm known as mirror descent, developed by Nemirovski and Yudin is a first order method that induces non-Euclidean geometries. However current analysis of mirror descent does not precisely characterize the induced non-Euclidean geometry nor does it consider performance in terms of statistical relative efficiency. In this paper, we prove that mirror descent induced by Bregman divergences is equivalent to the natural gradient descent algorithm on the dual Riemannian manifold. Using this equivalence, it follows that (1) mirror descent is the steepest descent direction along the Riemannian manifold of the exponential family; (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound and (3) natural gradient descent for manifolds corresponding to exponential families can be implemented as a first-order method through mirror descent.