Yuansi Chen

ML
h-index2
15papers
1,050citations
Novelty59%
AI Score50

15 Papers

MLSep 19, 2023
Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

Keru Wu, Yuansi Chen, Wooseok Ha et al.

Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, Camelyon17, and DomainNet datasets.

MLApr 8, 2023
A Simple Proof of the Mixing of Metropolis-Adjusted Langevin Algorithm under Smoothness and Isoperimetry

Yuansi Chen, Khashayar Gatmiry

We study the mixing time of Metropolis-Adjusted Langevin algorithm (MALA) for sampling a target density on $\mathbb{R}^d$. We assume that the target density satisfies $ψ_μ$-isoperimetry and that the operator norm and trace of its Hessian are bounded by $L$ and $Υ$ respectively. Our main result establishes that, from a warm start, to achieve $ε$-total variation distance to the target density, MALA mixes in $O\left(\frac{(LΥ)^{\frac12}}{ψ_μ^2} \log\left(\frac{1}ε\right)\right)$ iterations. Notably, this result holds beyond the log-concave sampling setting and the mixing time depends on only $Υ$ rather than its upper bound $L d$. In the $m$-strongly logconcave and $L$-log-smooth sampling setting, our bound recovers the previous minimax mixing bound of MALA~\cite{wu2021minimax}.

CODec 9, 2024Code
PolytopeWalk: Sparse MCMC Sampling over Polytopes

Benny Sun, Yuansi Chen

High dimensional sampling is an important computational tool in statistics and other computational disciplines, with applications ranging from Bayesian statistical uncertainty quantification, metabolic modeling in systems biology to volume computation. We present $\textsf{PolytopeWalk}$, a new scalable Python library designed for uniform sampling over polytopes. The library provides an end-to-end solution, which includes preprocessing algorithms such as facial reduction and initialization methods. Six state-of-the-art MCMC algorithms on polytopes are implemented, including the Dikin, Vaidya, and John Walk. Additionally, we introduce novel sparse constrained formulations of these algorithms, enabling efficient sampling from sparse polytopes of the form $K_2 = \{x \in \mathbb{R}^d \ | \ Ax = b, x \succeq_k 0\}$. This implementation maintains sparsity in $A$, ensuring scalability to high dimensional settings $(d > 10^5)$. We demonstrate the improved sampling efficiency and per-iteration cost on both Netlib datasets and structured polytopes. $\textsf{PolytopeWalk}$ is available at github.com/ethz-randomwalk/polytopewalk with documentation at polytopewalk.readthedocs.io .

39.8PRMay 1
Talagrand's convolution conjecture up to loglog via perturbed reverse heat

Yuansi Chen

We prove that under the heat semigroup $(P_τ)$ on the Boolean hypercube, any nonnegative function exhibits a uniform tail bound that is better than Markov's inequality. Specifically, for any $τ> 0$, $n \geq 1$, $η> e^3$, and $f: \{-1,1\}^n \to \mathbb{R}_+$ with $\int f dμ> 0$, we have \begin{align*} \mathbb{P}_{X \sim μ}\left( P_τf(X) > η\int f dμ\right) \leq c_τ\frac{ (\log \log η)^{\frac32} }{η\sqrt{\log η}}, \end{align*} where $μ$ is the uniform measure on the Boolean hypercube $\{-1,1\}^n$ and $c_τ$ is a constant that depends only on $τ$. This result resolves Talagrand's convolution conjecture up to a dimension-free $(\log \log η)^{\frac32}$ factor. Our proof uses the reverse heat process on the Boolean hypercube, a coupling construction with carefully engineered perturbations of jump rates and a time-smoothed anti-concentration estimate.

CVMay 26, 2014Code
Fast and Robust Archetypal Analysis for Representation Learning

Yuansi Chen, Julien Mairal, Zaid Harchaoui

We revisit a pioneer unsupervised learning technique called archetypal analysis, which is related to successful data analysis methods such as sparse coding and non-negative matrix factorization. Since it was proposed, archetypal analysis did not gain a lot of popularity even though it produces more interpretable models than other alternatives. Because no efficient implementation has ever been made publicly available, its application to important scientific problems may have been severely limited. Our goal is to bring back into favour archetypal analysis. We propose a fast optimization scheme using an active-set strategy, and provide an efficient open-source implementation interfaced with Matlab, R, and Python. Then, we demonstrate the usefulness of archetypal analysis for computer vision tasks, such as codebook learning, signal classification, and large image collection visualization.

DSDec 15, 2024
Regularized Dikin Walks for Sampling Truncated Logconcave Measures, Mixed Isoperimetry and Beyond Worst-Case Analysis

Minhui Jiang, Yuansi Chen

We study the problem of drawing samples from a logconcave distribution truncated on a polytope, motivated by computational challenges in Bayesian statistical models with indicator variables, such as probit regression. Building on interior point methods and the Dikin walk for sampling from uniform distributions, we analyze the mixing time of regularized Dikin walks. Our contributions are threefold. First, for a logconcave and log-smooth distribution with condition number $κ$, truncated on a polytope in $\mathbb{R}^n$ defined with $m$ linear constraints, we prove that the soft-threshold Dikin walk mixes in $\widetilde{O}((m+κ)n)$ iterations from a warm initialization. It improves upon prior work which required the polytope to be bounded and involved a bound dependent on the radius of the bounded region. Moreover, we introduce the regularized Dikin walk using Lewis weights for approximating the John ellipsoid. We show that it mixes in $\widetilde{O}((n^{2.5}+κn)$. Second, we extend the mixing time guarantees mentioned above to weakly log-concave distributions truncated on polytopes, provided that they have a finite covariance matrix. Third, going beyond worst-case mixing time analysis, we demonstrate that soft-threshold Dikin walk can mix significantly faster when only a limited number of constraints intersect the high-probability mass of the distribution, improving the $\widetilde{O}((m+κ)n)$ upper bound to $\widetilde{O}(m + κn)$. Additionally, per-iteration complexity of regularized Dikin walk and ways to generate a warm initialization are discussed to facilitate practical implementation.

MLJul 19, 2025
When few labeled target data suffice: a theory of semi-supervised domain adaptation via fine-tuning from multiple adaptive starts

Wooseok Ha, Yuansi Chen

Semi-supervised domain adaptation (SSDA) aims to achieve high predictive performance in the target domain with limited labeled target data by exploiting abundant source and unlabeled target data. Despite its significance in numerous applications, theory on the effectiveness of SSDA remains largely unexplored, particularly in scenarios involving various types of source-target distributional shifts. In this work, we develop a theoretical framework based on structural causal models (SCMs) which allows us to analyze and quantify the performance of SSDA methods when labeled target data is limited. Within this framework, we introduce three SSDA methods, each having a fine-tuning strategy tailored to a distinct assumption about the source and target relationship. Under each assumption, we demonstrate how extending an unsupervised domain adaptation (UDA) method to SSDA can achieve minimax-optimal target performance with limited target labels. When the relationship between source and target data is only vaguely known -- a common practical concern -- we propose the Multi Adaptive-Start Fine-Tuning (MASFT) algorithm, which fine-tunes UDA models from multiple starting points and selects the best-performing one based on a small hold-out target validation dataset. Combined with model selection guarantees, MASFT achieves near-optimal target predictive performance across a broad range of types of distributional shifts while significantly reducing the need for labeled target data. We empirically validate the effectiveness of our proposed methods through simulations.

MLSep 27, 2021
Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm for Log-Concave Sampling

Keru Wu, Scott Schmidler, Yuansi Chen

We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $κ$, we show that MALA with a warm start mixes in $\tilde O(κ\sqrt{d})$ iterations up to logarithmic factors. This improves upon the previous work on the dependency of either the condition number $κ$ or the dimension $d$. Our proof relies on comparing the leapfrog integrator with the continuous Hamiltonian dynamics, where we establish a new concentration bound for the acceptance rate. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde Ω(κ\sqrt{d})$ steps to mix. The lower bound for MALA matches our upper bound in terms of condition number and dimension. Finally, numerical experiments are included to validate our theoretical results.

MLOct 29, 2020
Domain adaptation under structural causal models

Yuansi Chen, Peter Bühlmann

Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models that enables analysis and comparison of the prediction performance of DA methods. This framework also allows us to itemize the assumptions needed for the DA methods to have a low target error. Additionally, with insights from our theory, we propose a new DA method called CIRM that outperforms existing DA methods when both the covariates and label distributions are perturbed in the target data. We complement the theoretical analysis with extensive simulations to show the necessity of the devised assumptions. Reproducible synthetic and real data experiments are also provided to illustrate the strengths and weaknesses of DA methods when parts of the assumptions in our theory are violated.

MLMay 29, 2019
Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients

Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright et al.

Hamiltonian Monte Carlo (HMC) is a state-of-the-art Markov chain Monte Carlo sampling algorithm for drawing samples from smooth probability densities over continuous spaces. We study the variant most widely used in practice, Metropolized HMC with the Störmer-Verlet or leapfrog integrator, and make two primary contributions. First, we provide a non-asymptotic upper bound on the mixing time of the Metropolized HMC with explicit choices of step-size and number of leapfrog steps. This bound gives a precise quantification of the faster convergence of Metropolized HMC relative to simpler MCMC algorithms such as the Metropolized random walk, or Metropolized Langevin algorithm. Second, we provide a general framework for sharpening mixing time bounds of Markov chains initialized at a substantial distance from the target distribution over continuous spaces. We apply this sharpening device to the Metropolized random walk and Langevin algorithms, thereby obtaining improved mixing time bounds from a non-warm initial distribution.

MLNov 20, 2018
Sampling Can Be Faster Than Optimization

Yi-An Ma, Yuansi Chen, Chi Jin et al.

Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these two kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multi-stable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

MLApr 4, 2018
Stability and Convergence Trade-off of Iterative Optimization Algorithms

Yuansi Chen, Chi Jin, Bin Yu

The overall performance or expected excess risk of an iterative machine learning algorithm can be decomposed into training error and generalization error. While the former is controlled by its convergence analysis, the latter can be tightly handled by algorithmic stability. The machine learning community has a rich history investigating convergence and stability separately. However, the question about the trade-off between these two quantities remains open. In this paper, we show that for any iterative algorithm at any iteration, the overall performance is lower bounded by the minimax statistical error over an appropriately chosen loss function class. This implies an important trade-off between convergence and stability of the algorithm -- a faster converging algorithm has to be less stable, and vice versa. As a direct consequence of this fundamental tradeoff, new convergence lower bounds can be derived for classes of algorithms constrained with different stability bounds. In particular, when the loss function is convex (or strongly convex) and smooth, we discuss the stability upper bounds of gradient descent (GD) and stochastic gradient descent and their variants with decreasing step sizes. For Nesterov's accelerated gradient descent (NAG) and heavy ball method (HB), we provide stability upper bounds for the quadratic loss function. Applying existing stability upper bounds for the gradient methods in our trade-off framework, we obtain lower bounds matching the well-established convergence upper bounds up to constants for these algorithms and conjecture similar lower bounds for NAG and HB. Finally, we numerically demonstrate the tightness of our stability bounds in terms of exponents in the rate and also illustrate via a simulated logistic regression problem that our stability bounds reflect the generalization errors better than the simple uniform convergence bounds for GD and NAG.

MLJan 8, 2018
Log-concave sampling: Metropolis-Hastings algorithms are fast

Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright et al.

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $δ$ for a density with condition number $κ$, we show that MALA requires $\mathcal{O} \big(κd \log(1/δ) \big)$ steps, as compared to the $\mathcal{O} \big(κ^2 d/δ^2 \big)$ steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for the Metropolized random walk (MRW) and obtain $\mathcal{O}(κ)$ mixing time slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the benefits of Metropolis-Hastings adjustment for Langevin-type sampling algorithms.

MLOct 23, 2017
Fast MCMC sampling algorithms on polytopes

Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright et al.

We propose and analyze two new MCMC sampling algorithms, the Vaidya walk and the John walk, for generating samples from the uniform distribution over a polytope. Both random walks are sampling algorithms derived from interior point methods. The former is based on volumetric-logarithmic barrier introduced by Vaidya whereas the latter uses John's ellipsoids. We show that the Vaidya walk mixes in significantly fewer steps than the logarithmic-barrier based Dikin walk studied in past work. For a polytope in $\mathbb{R}^d$ defined by $n >d$ linear constraints, we show that the mixing time from a warm start is bounded as $\mathcal{O}(n^{0.5}d^{1.5})$, compared to the $\mathcal{O}(nd)$ mixing time bound for the Dikin walk. The cost of each step of the Vaidya walk is of the same order as the Dikin walk, and at most twice as large in terms of constant pre-factors. For the John walk, we prove an $\mathcal{O}(d^{2.5}\cdot\log^4(n/d))$ bound on its mixing time and conjecture that an improved variant of it could achieve a mixing time of $\mathcal{O}(d^2\cdot\text{polylog}(n/d))$. Additionally, we propose variants of the Vaidya and John walks that mix in polynomial time from a deterministic starting point. The speed-up of the Vaidya walk over the Dikin walk are illustrated in numerical examples.

LGDec 11, 2016
Self-calibrating Neural Networks for Dimensionality Reduction

Yuansi Chen, Cengiz Pehlevan, Dmitri B. Chklovskii

Recently, a novel family of biologically plausible online algorithms for reducing the dimensionality of streaming data has been derived from the similarity matching principle. In these algorithms, the number of output dimensions can be determined adaptively by thresholding the singular values of the input data matrix. However, setting such threshold requires knowing the magnitude of the desired singular values in advance. Here we propose online algorithms where the threshold is self-calibrating based on the singular values computed from the existing observations. To derive these algorithms from the similarity matching cost function we propose novel regularizers. As before, these online algorithms can be implemented by Hebbian/anti-Hebbian neural networks in which the learning rule depends on the chosen regularizer. We demonstrate both mathematically and via simulation the effectiveness of these online algorithms in various settings.