CVNov 17, 2022
How to Fine-Tune Vision Models with SGDAnanya Kumar, Ruoqi Shen, Sebastien Bubeck et al. · microsoft-research
SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.
LGMar 3, 2022
Data Augmentation as Feature ManipulationRuoqi Shen, Sébastien Bubeck, Suriya Gunasekar · microsoft-research
Data augmentation is a cornerstone of the machine learning pipeline, yet its theoretical underpinnings remain unclear. Is it merely a way to artificially augment the data set size? Or is it about encouraging the model to satisfy certain invariance? In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmentation can alter the relative importance of various features, effectively making certain informative but hard to learn features more likely to be captured in the learning process. Importantly, we show that this effect is more pronounced for non-linear models, such as neural networks. Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view data model by Allen-Zhu and Li [2020]. We complement this analysis with further experimental evidence that data augmentation can be viewed as feature manipulation.
DSOct 13, 2022
Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integratorsYunbum Kook, Yin Tat Lee, Ruoqi Shen et al.
We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-α^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert α\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $ε$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.
CLNov 22, 2023
Positional Description Matters for Transformers ArithmeticRuoqi Shen, Sébastien Bubeck, Ronen Eldan et al.
Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).
LGJul 18, 2022
Private Convex Optimization in General NormsSivakanth Gopi, Yin Tat Lee, Daogao Liu et al.
We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work of [Gopi, Lee, Liu '22] to non-Euclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DP-ERM (empirical risk minimization) and DP-SCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers non-private SCO rates achieved by mirror descent as the privacy parameter $ε\to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacy-utility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten-$p$ norm optimization frameworks are complemented with polynomial-time samplers whose query complexity we explicitly bound.
CLOct 15, 2023
FiLM: Fill-in Language Models for Any-Order GenerationTianxiao Shen, Hao Peng, Ruoqi Shen et al.
Language models have become the backbone of today's AI systems. However, their predominant left-to-right generation limits the use of bidirectional context, which is essential for tasks that involve filling text in the middle. We propose the Fill-in Language Model (FiLM), a new language modeling approach that allows for flexible generation at any position without adhering to a specific generation order. Its training extends the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance the generative capabilities of FiLM. During inference, FiLM can seamlessly insert missing phrases, sentences, or paragraphs, ensuring that the outputs are fluent and are coherent with the surrounding context. In both automatic and human evaluations, FiLM outperforms existing infilling methods that rely on left-to-right language models trained on rearranged text segments. FiLM is easy to implement and can be either trained from scratch or fine-tuned from a left-to-right language model. Notably, as the model size grows, FiLM's perplexity approaches that of strong left-to-right language models of similar sizes, indicating FiLM's scalability and potential as a large language model.
DSFeb 13, 2023
Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal SamplerSivakanth Gopi, Yin Tat Lee, Daogao Liu et al.
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.
LGFeb 20, 2022
On Optimal Early Stopping: Over-informative versus Under-informative ParametrizationRuoqi Shen, Liyao Gao, Yi-An Ma
Early stopping is a simple and widely used method to prevent over-training neural networks. We develop theoretical results to reveal the relationship between the optimal early stopping time and model dimension as well as sample size of the dataset for certain linear models. Our results demonstrate two very different behaviors when the model dimension exceeds the number of features versus the opposite scenario. While most previous works on linear models focus on the latter setting, we observe that the dimension of the model often exceeds the number of features arising from data in common deep learning tasks and propose a model to study this setting. We demonstrate experimentally that our theoretical results on optimal early stopping time corresponds to the training process of deep neural networks.
LGFeb 3, 2022
Sampling with Riemannian Hamiltonian Monte Carlo in a Constrained SpaceYunbum Kook, Yin Tat Lee, Ruoqi Shen et al.
We demonstrate for the first time that ill-conditioned, non-smooth, constrained distributions in very high dimension, upwards of 100,000, can be sampled efficiently $\textit{in practice}$. Our algorithm incorporates constraints into the Riemannian version of Hamiltonian Monte Carlo and maintains sparsity. This allows us to achieve a mixing rate independent of smoothness and condition numbers. On benchmark data sets in systems biology and linear programming, our algorithm outperforms existing packages by orders of magnitude. In particular, we achieve a 1,000-fold speed-up for sampling from the largest published human metabolic network (RECON3D). Our package has been incorporated into the COBRA toolbox.
STDec 23, 2021
Analysis of Langevin Monte Carlo from Poincaré to Log-SobolevSinho Chewi, Murat A. Erdogdu, Mufan Bill Li et al.
Classically, the continuous-time Langevin diffusion converges exponentially fast to its stationary distribution $π$ under the sole assumption that $π$ satisfies a Poincaré inequality. Using this fact to provide guarantees for the discrete-time Langevin Monte Carlo (LMC) algorithm, however, is considerably more challenging due to the need for working with chi-squared or Rényi divergences, and prior works have largely focused on strongly log-concave targets. In this work, we provide the first convergence guarantees for LMC assuming that $π$ satisfies either a Latała--Oleszkiewicz or modified log-Sobolev inequality, which interpolates between the Poincaré and log-Sobolev settings. Unlike prior works, our results allow for weak smoothness and do not require convexity or dissipativity conditions.
DSJun 10, 2021
Lower Bounds on Metropolized Sampling Methods for Well-Conditioned DistributionsYin Tat Lee, Ruoqi Shen, Kevin Tian
We give lower bounds on the performance of two of the most popular sampling methods in practice, the Metropolis-adjusted Langevin algorithm (MALA) and multi-step Hamiltonian Monte Carlo (HMC) with a leapfrog integrator, when applied to well-conditioned distributions. Our main result is a nearly-tight lower bound of $\widetildeΩ(κd)$ on the mixing time of MALA from an exponentially warm start, matching a line of algorithmic results up to logarithmic factors and answering an open question of Chewi et. al. We also show that a polynomial dependence on dimension is necessary for the relaxation time of HMC under any number of leapfrog steps, and bound the gains achievable by changing the step count. Our HMC analysis draws upon a novel connection between leapfrog integration and Chebyshev polynomials, which may be of independent interest.
LGFeb 19, 2021
Near-Optimal Randomized Exploration for Tabular Markov Decision ProcessesZhihan Xiong, Ruoqi Shen, Qiwen Cui et al.
We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $Ω\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.
DSOct 7, 2020
Structured Logconcave Sampling with a Restricted Gaussian OracleYin Tat Lee, Ruoqi Shen, Kevin Tian
We give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to improve dependences on problem conditioning. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for $g: \mathbb{R}^d \rightarrow \mathbb{R}$, which is a sampler for distributions whose negative log-likelihood sums a quadratic and $g$. By combining our reduction framework with our new samplers, we obtain the following bounds for sampling structured distributions to total variation distance $ε$. For composite densities $\exp(-f(x) - g(x))$, where $f$ has condition number $κ$ and convex (but possibly non-smooth) $g$ admits an RGO, we obtain a mixing time of $O(κd \log^3\frac{κd}ε)$, matching the state-of-the-art non-composite bound; no composite samplers with better mixing than general-purpose logconcave samplers were previously known. For logconcave finite sums $\exp(-F(x))$, where $F(x) = \frac{1}{n}\sum_{i \in [n]} f_i(x)$ has condition number $κ$, we give a sampler querying $\widetilde{O}(n + κ\max(d, \sqrt{nd}))$ gradient oracles to $\{f_i\}_{i \in [n]}$; no high-accuracy samplers with nontrivial gradient query complexity were previously known. For densities with condition number $κ$, we give an algorithm obtaining mixing time $O(κd \log^2\frac{κd}ε)$, improving the prior state-of-the-art by a logarithmic factor with a significantly simpler analysis; we also show a zeroth-order algorithm attains the same query complexity.
LGSep 21, 2020
Generalized Leverage Score Sampling for Neural NetworksJason D. Lee, Ruoqi Shen, Zhao Song et al.
Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e.g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow. Recently, it has been shown that leverage score sampling helps to accelerate kernel methods [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17]. In this work, we generalize the results in [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17] to a broader class of kernels. We further bring the leverage score sampling into the field of deep learning theory. $\bullet$ We show the connection between the initialization for neural network training and approximating the neural tangent kernel with random features. $\bullet$ We prove the equivalence between regularized neural network and neural tangent kernel ridge regression under the initialization of both classical random Gaussian and leverage score sampling.
LGJun 10, 2020
Composite Logconcave Sampling with a Restricted Gaussian OracleRuoqi Shen, Kevin Tian, Yin Tat Lee
We consider sampling from composite densities on $\mathbb{R}^d$ of the form $dπ(x) \propto \exp(-f(x) - g(x))dx$ for well-conditioned $f$ and convex (but possibly non-smooth) $g$, a family generalizing restrictions to a convex set, through the abstraction of a restricted Gaussian oracle. For $f$ with condition number $κ$, our algorithm runs in $O \left(κ^2 d \log^2\tfrac{κd}ε\right)$ iterations, each querying a gradient of $f$ and a restricted Gaussian oracle, to achieve total variation distance $ε$. The restricted Gaussian oracle, which draws samples from a distribution whose negative log-likelihood sums a quadratic and $g$, has been previously studied and is a natural extension of the proximal oracle used in composite optimization. Our algorithm is conceptually simple and obtains stronger provable guarantees and greater generality than existing methods for composite sampling. We conduct experiments showing our algorithm vastly improves upon the hit-and-run algorithm for sampling the restriction of a (non-diagonal) Gaussian to the positive orthant.
LGJun 10, 2020
When is Particle Filtering Efficient for Planning in Partially Observed Linear Dynamical Systems?Simon S. Du, Wei Hu, Zhiyuan Li et al.
Particle filtering is a popular method for inferring latent states in stochastic dynamical systems, whose theoretical properties have been well studied in machine learning and statistics communities. In many control problems, e.g., partially observed linear dynamical systems (POLDS), oftentimes the inferred latent state is further used for planning at each step. This paper initiates a rigorous study on the efficiency of particle filtering for sequential planning, and gives the first particle complexity bounds. Though errors in past actions may affect the future, we are able to bound the number of particles needed so that the long-run reward of the policy based on particle filtering is close to that based on exact inference. In particular, we show that, in stable systems, polynomially many particles suffice. Key in our proof is a coupling of the ideal sequence based on the exact planning and the sequence generated by approximate planning based on particle filtering. We believe this technique can be useful in other sequential decision-making problems.
LGFeb 10, 2020
Logsmooth Gradient Concentration and Tighter Runtimes for Metropolized Hamiltonian Monte CarloYin Tat Lee, Ruoqi Shen, Kevin Tian
We show that the gradient norm $\|\nabla f(x)\|$ for $x \sim \exp(-f(x))$, where $f$ is strongly convex and smooth, concentrates tightly around its mean. This removes a barrier in the prior state-of-the-art analysis for the well-studied Metropolized Hamiltonian Monte Carlo (HMC) algorithm for sampling from a strongly logconcave distribution. We correspondingly demonstrate that Metropolized HMC mixes in $\tilde{O}(κd)$ iterations, improving upon the $\tilde{O}(κ^{1.5}\sqrt{d} + κd)$ runtime of (Dwivedi et. al. '18, Chen et. al. '19) by a factor $(κ/d)^{1/2}$ when the condition number $κ$ is large. Our mixing time analysis introduces several techniques which to our knowledge have not appeared in the literature and may be of independent interest, including restrictions to a nonconvex set with good conductance behavior, and a new reduction technique for boosting a constant-accuracy total variation guarantee under weak warmness assumptions. This is the first high-accuracy mixing time result for logconcave distributions using only first-order function information which achieves linear dependence on $κ$; we also give evidence that this dependence is likely to be necessary for standard Metropolized first-order methods.
LGSep 12, 2019
The Randomized Midpoint Method for Log-Concave SamplingRuoqi Shen, Yin Tat Lee
Sampling from log-concave distributions is a well researched problem that has many applications in statistics and machine learning. We study the distributions of the form $p^{*}\propto\exp(-f(x))$, where $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ has an $L$-Lipschitz gradient and is $m$-strongly convex. In our paper, we propose a Markov chain Monte Carlo (MCMC) algorithm based on the underdamped Langevin diffusion (ULD). It can achieve $ε\cdot D$ error (in 2-Wasserstein distance) in $\tilde{O}\left(κ^{7/6}/ε^{1/3}+κ/ε^{2/3}\right)$ steps, where $D\overset{\mathrm{def}}{=}\sqrt{\frac{d}{m}}$ is the effective diameter of the problem and $κ\overset{\mathrm{def}}{=}\frac{L}{m}$ is the condition number. Our algorithm performs significantly faster than the previously best known algorithm for solving this problem, which requires $\tilde{O}\left(κ^{1.5}/ε\right)$ steps. Moreover, our algorithm can be easily parallelized to require only $O(κ\log\frac{1}ε)$ parallel steps. To solve the sampling problem, we propose a new framework to discretize stochastic differential equations. We apply this framework to discretize and simulate ULD, which converges to the target distribution $p^{*}$. The framework can be used to solve not only the log-concave sampling problem, but any problem that involves simulating (stochastic) differential equations.