David B Dunson

ML
6papers
96citations
Novelty58%
AI Score26

6 Papers

MLOct 14, 2021
Inferring manifolds using Gaussian processes

David B Dunson, Nan Wu

It is often of interest to infer lower-dimensional structure underlying complex data. As a flexible class of non-linear structures, it is common to focus on Riemannian manifolds. Most existing manifold learning algorithms replace the original data with lower-dimensional coordinates without providing an estimate of the manifold or using the manifold to denoise the original data. This article proposes a new methodology to address these problems, allowing interpolation of the estimated manifold between the fitted data points. The proposed approach is motivated by the novel theoretical properties of local covariance matrices constructed from samples near a manifold. Our results enable us to turn a global manifold reconstruction problem into a local regression problem, allowing for the application of Gaussian processes for probabilistic manifold reconstruction. In addition to the theory justifying our methodology, we provide simulated and real data examples to illustrate the performance.

MEOct 14, 2020
Graph Based Gaussian Processes on Restricted Domains

David B Dunson, Hau-Tieng Wu, Nan Wu

In nonparametric regression, it is common for the inputs to fall in a restricted subset of Euclidean space. Typical kernel-based methods that do not take into account the intrinsic geometry of the domain across which observations are collected may produce sub-optimal results. In this article, we focus on solving this problem in the context of Gaussian process (GP) models, proposing a new class of Graph Laplacian based GPs (GL-GPs), which learn a covariance that respects the geometry of the input domain. As the heat kernel is intractable computationally, we approximate the covariance using finitely-many eigenpairs of the Graph Laplacian (GL). The GL is constructed from a kernel which depends only on the Euclidean coordinates of the inputs. Hence, we can benefit from the full knowledge about the kernel to extend the covariance structure to newly arriving samples by a Nyström type extension. We provide substantial theoretical support for the GL-GP methodology, and illustrate performance gains in various applications.

STJun 29, 2019
Geodesic Distance Estimation with Spherelets

Didong Li, David B Dunson

Many statistical and machine learning approaches rely on pairwise distances between data points. The choice of distance metric has a fundamental impact on performance of these procedures, raising questions about how to appropriately calculate distances. When data points are real-valued vectors, by far the most common choice is the Euclidean distance. This article is focused on the problem of how to better calculate distances taking into account the intrinsic geometry of the data, assuming data are concentrated near an unknown subspace or manifold. The appropriate geometric distance corresponds to the length of the shortest path along the manifold, which is the geodesic distance. When the manifold is unknown, it is challenging to accurately approximate the geodesic distance. Current algorithms are either highly complex, and hence often impractical to implement, or based on simple local linear approximations and shortest path algorithms that may have inadequate accuracy. We propose a simple and general alternative, which uses pieces of spheres, or spherelets, to locally approximate the unknown subspace and thereby estimate the geodesic distance through paths over spheres. Theory is developed showing lower error for many manifolds, with applications in clustering, conditional density estimation and mean regression. The conclusion is supported through multiple simulation examples and real data sets.

MLMar 3, 2019
Classification via local manifold approximation

Didong Li, David B Dunson

Classifiers label data as belonging to one of a set of groups based on input features. It is challenging to obtain accurate classification performance when the feature distributions in the different classes are complex, with nonlinear, overlapping and intersecting supports. This is particularly true when training data are limited. To address this problem, this article proposes a new type of classifier based on obtaining a local approximation to the support of the data within each class in a neighborhood of the feature to be classified, and assigning the feature to the class having the closest support. This general algorithm is referred to as LOcal Manifold Approximation (LOMA) classification. As a simple and theoretically supported special case having excellent performance in a broad variety of examples, we use spheres for local approximation, obtaining a SPherical Approximation (SPA) classifier. We illustrate substantial gains for SPA over competitors on a variety of challenging simulated and real data examples.

MLOct 31, 2018
Targeted stochastic gradient Markov chain Monte Carlo for hidden Markov models with rare latent states

Rihui Ou, Deborshee Sen, Alexander L Young et al.

Markov chain Monte Carlo (MCMC) algorithms for hidden Markov models often rely on the forward-backward sampler. This makes them computationally slow as the length of the time series increases, motivating the development of sub-sampling-based approaches. These approximate the full posterior by using small random subsequences of the data at each MCMC iteration within stochastic gradient MCMC. In the presence of imbalanced data resulting from rare latent states, subsequences often exclude rare latent state data, leading to inaccurate inference and prediction/detection of rare events. We propose a targeted sub-sampling (TASS) approach that over-samples observations corresponding to rare latent states when calculating the stochastic gradient of parameters associated with them. TASS uses an initial clustering of the data to construct subsequence weights that reduce the variance in gradient estimation. This leads to improved sampling efficiency, in particular in settings where the rare latent states correspond to extreme observations. We demonstrate substantial gains in predictive and inferential accuracy on real and synthetic examples.

MLOct 19, 2018
Bayesian Distance Clustering

Leo L Duan, David B Dunson

Model-based clustering is widely-used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data. Keywords: Distance-based clustering; Mixture model; Model-based clustering; Model misspecification; Pairwise distance matrix; Partial likelihood; Robustness.