Chao Ma

LG
h-index17
20papers
1,569citations
Novelty47%
AI Score34

20 Papers

17.6LGDec 17, 2024
SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Chao Ma, Wenbo Gong, Meyer Scetbon et al.

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.

16.9LGFeb 10, 2025
Gradient Multi-Normalization for Stateless and Scalable LLM Training

Meyer Scetbon, Chao Ma, Wenbo Gong et al.

Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients. Motivated by the success of SWAN, we introduce a novel framework for designing stateless optimizers that normalizes stochastic gradients according to multiple norms. To achieve this, we propose a simple alternating scheme to enforce the normalization of gradients w.r.t these norms. We show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem, and that SWAN is a particular instance of our approach with carefully chosen norms, providing a deeper understanding of its design. However, SWAN's computationally expensive whitening/orthogonalization step limit its practicality for large LMs. Using our principled perspective, we develop of a more efficient, scalable, and practical stateless optimizer. Our algorithm relaxes the properties of SWAN, significantly reducing its computational cost while retaining its memory efficiency, making it applicable to training large-scale models. Experiments on pre-training LLaMA models with up to 1 billion parameters demonstrate a 3X speedup over Adam with significantly reduced memory requirements, outperforming other memory-efficient baselines.

30.6LGOct 12, 2020
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

Pan Zhou, Jiashi Feng, Chao Ma et al.

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones , our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

26.9LGSep 22, 2020
Towards a Mathematical Understanding of Neural Network-Based Machine Learning: what we know and what we don't

Weinan E, Chao Ma, Stephan Wojtowytsch et al.

The purpose of this article is to review the achievements made in the last few years towards the understanding of the reasons behind the success and subtleties of neural network-based machine learning. In the tradition of good old applied mathematics, we will not only give attention to rigorous mathematical results, but also the insight we have gained from careful numerical experiments as well as the analysis of simplified models. Along the way, we also list the open problems which we believe to be the most important topics for further study. This is not a complete overview over this quickly moving field, but we hope to provide a perspective which may be helpful especially to new researchers in the area.

12.8LGSep 14, 2020
Complexity Measures for Neural Networks with General Activation Functions Using Path-based Norms

Zhong Li, Chao Ma, Lei Wu

A simple approach is proposed to obtain complexity controls for neural networks with general activation functions. The approach is motivated by approximating the general activation functions with one-dimensional ReLU networks, which reduces the problem to the complexity controls of ReLU networks. Specifically, we consider two-layer networks and deep residual networks, for which path-based norms are derived to control complexities. We also provide preliminary analyses of the function spaces induced by these norms and a priori estimates of the corresponding regularized estimators.

14.7LGSep 14, 2020
A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Chao Ma, Lei Wu, Weinan E

The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate to 0 while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes, and divergence. In particular, Adam converges much smoother and even faster when the values of the two momentum factors are close to each other. This observation is particularly important for scientific computing tasks, for which the training process usually proceeds into the high precision regime.

9.6LGAug 13, 2020
The Slow Deterioration of the Generalization Error of the Random Feature Model

Chao Ma, Lei Wu, Weinan E

The random feature model exhibits a kind of resonance behavior when the number of parameters is close to the training sample size. This behavior is characterized by the appearance of large generalization gap, and is due to the occurrence of very small eigenvalues for the associated Gram matrix. In this paper, we examine the dynamic behavior of the gradient descent algorithm in this regime. We show, both theoretically and experimentally, that there is a dynamic self-correction mechanism at work: The larger the eventual generalization gap, the slower it develops, both because of the small eigenvalues. This gives us ample time to stop the training process and obtain solutions with good generalization property.

10.4MLDec 15, 2019
The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks

Weinan E, Chao Ma, Lei Wu

We study the generalization properties of minimum-norm solutions for three over-parametrized machine learning models including the random feature model, the two-layer neural network model and the residual network model. We proved that for all three models, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate, up to some logarithmic terms, as long as the models are sufficiently over-parametrized.

10.3LGNov 2, 2019
Global Convergence of Gradient Descent for Deep Linear Residual Networks

Lei Wu, Qingcan Wang, Chao Ma

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O(L^3 \log(1/\varepsilon))$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(Ω(L))$ convergence time for the standard initialization (Xavier or near-identity) [Shamir, 2018] together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.

4.7CVJul 29, 2019
End-to-End Learning Deep CRF models for Multi-Object Tracking

Jun Xiang, Ma Chao, Guohan Xu et al.

Existing deep multi-object tracking (MOT) approaches first learn a deep representation to describe target objects and then associate detection results by optimizing a linear assignment problem. Despite demonstrated successes, it is challenging to discriminate target objects under mutual occlusion or to reduce identity switches in crowded scenes. In this paper, we propose learning deep conditional random field (CRF) networks, aiming to model the assignment costs as unary potentials and the long-term dependencies among detection results as pairwise potentials. Specifically, we use a bidirectional long short-term memory (LSTM) network to encode the long-term dependencies. We pose the CRF inference as a recurrent neural network learning process using the standard gradient descent algorithm, where unary and pairwise potentials are jointly optimized in an end-to-end manner. Extensive experimental results on the challenging MOT datasets including MOT-2015 and MOT-2016, demonstrate that our approach achieves the state of the art performances in comparison with published works on both benchmarks.

30.3LGJun 18, 2019
The Barron Space and the Flow-induced Function Spaces for Neural Network Models

Weinan E, Chao Ma, Lei Wu

One of the key issues in the analysis of machine learning models is to identify the appropriate function space and norm for the model. This is the set of functions endowed with a quantity which can control the approximation and estimation errors by a particular machine learning model. In this paper, we address this issue for two representative neural network models: the two-layer networks and the residual neural networks. We define the Barron space and show that it is the right space for two-layer neural network models in the sense that optimal direct and inverse approximation theorems hold for functions in the Barron space. For residual neural network models, we construct the so-called flow-induced function space, and prove direct and inverse approximation theorems for this space. In addition, we show that the Rademacher complexity for bounded sets under these norms has the optimal upper bounds.

14.6LGApr 10, 2019
Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

Weinan E, Chao Ma, Qingcan Wang et al.

The behavior of the gradient descent (GD) algorithm is analyzed for a deep neural network model with skip-connections. It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast. Generalization error estimates along the GD path are also established. As a consequence, it is shown that when the target function is in the reproducing kernel Hilbert space (RKHS) with a kernel defined by the initialization, there exist generalizable early-stopping solutions along the GD path. In addition, it is also shown that the GD path is uniformly close to the functions given by the related random feature model. Consequently, in this "implicit regularization" setting, the deep neural network model deteriorates to a random feature model. Our results hold for neural networks of any width larger than the input dimension.

23.9LGApr 8, 2019
A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

Weinan E, Chao Ma, Lei Wu

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.

16.8LGMar 6, 2019
A Priori Estimates of the Population Risk for Residual Networks

Weinan E, Chao Ma, Qingcan Wang

Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model. An important part of the regularized model is the usage of a new path norm, called the weighted path norm, as the regularization term. The weighted path norm treats the skip connections and the nonlinearities differently so that paths with more nonlinearities are regularized by larger weights. The error estimates are a priori in the sense that the estimates depend only on the target function, not on the parameters obtained in the training process. The estimates are optimal, in a high dimensional setting, in the sense that both the bound for the approximation and estimation errors are comparable to the Monte Carlo error rates. A crucial step in the proof is to establish an optimal bound for the Rademacher complexity of the residual networks. Comparisons are made with existing norm-based generalization error bounds.

27.2MLOct 15, 2018
A Priori Estimates of the Population Risk for Two-layer Neural Networks

Weinan E, Chao Ma, Lei Wu

New estimates for the population risk are established for two-layer neural networks. These estimates are nearly optimal in the sense that the error rates scale in the same way as the Monte Carlo error rates. They are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model, in contrast with most existing results which are a posteriori in nature. Using these a priori estimates, we provide a perspective for understanding why two-layer neural networks perform better than the related kernel methods.

7.6IVOct 12, 2018Code
Heterogeneous multireference alignment for images with application to 2-D classification in single particle reconstruction

Chao Ma, Tamir Bendory, Nicolas Boumal et al.

Motivated by the task of 2-D classification in single particle reconstruction by cryo-electron microscopy (cryo-EM), we consider the problem of heterogeneous multireference alignment of images. In this problem, the goal is to estimate a (typically small) set of target images from a (typically large) collection of observations. Each observation is a rotated, noisy version of one of the target images. For each individual observation, neither the rotation nor which target image has been rotated are known. As the noise level in cryo-EM data is high, clustering the observations and estimating individual rotations is challenging. We propose a framework to estimate the target images directly from the observations, completely bypassing the need to cluster or register the images. The framework consists of two steps. First, we estimate rotation-invariant features of the images, such as the bispectrum. These features can be estimated to any desired accuracy, at any noise level, provided sufficiently many observations are collected. Then, we estimate the images from the invariant features. Numerical experiments on synthetic cryo-EM datasets demonstrate the effectiveness of the method. Ultimately, we outline future developments required to apply this method to experimental data.

18.8IROct 8, 2018
Person-Job Fit: Adapting the Right Talent for the Right Job with Joint Representation Learning

Chen Zhu, Hengshu Zhu, Hui Xiong et al.

Person-Job Fit is the process of matching the right talent for the right job by identifying talent competencies that are required for the job. While many qualitative efforts have been made in related fields, it still lacks of quantitative ways of measuring talent competencies as well as the job's talent requirements. To this end, in this paper, we propose a novel end-to-end data-driven model based on Convolutional Neural Network (CNN), namely Person-Job Fit Neural Network (PJFNN), for matching a talent qualification to the requirements of a job. To be specific, PJFNN is a bipartite neural network which can effectively learn the joint representation of Person-Job fitness from historical job applications. In particular, due to the design of a hierarchical representation structure, PJFNN can not only estimate whether a candidate fits a job, but also identify which specific requirement items in the job posting are satisfied by the candidate by measuring the distances between corresponding latent representations. Finally, the extensive experiments on a large-scale real-world dataset clearly validate the performance of PJFNN in terms of Person-Job Fit prediction. Also, we provide effective data visualization to show some job and talent benchmark insights obtained by PJFNN.

22.9LGSep 28, 2018Code
EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE

Chao Ma, Sebastian Tschiatschek, Konstantina Palla et al.

Many real-life decision-making situations allow further relevant information to be acquired at a specific cost, for example, in assessing the health status of a patient we may decide to take additional measurements such as diagnostic tests or imaging scans before making a final assessment. Acquiring more relevant information enables better decision making, but may be costly. How can we trade off the desire to make good decisions by acquiring further information with the cost of performing that acquisition? To this end, we propose a principled framework, named EDDI (Efficient Dynamic Discovery of high-value Information), based on the theory of Bayesian experimental design. In EDDI, we propose a novel partial variational autoencoder (Partial VAE) to predict missing data entries problematically given any subset of the observed ones, and combine it with an acquisition function that maximizes expected information gain on a set of target variables. We show cost reduction at the same decision quality and improved decision quality at the same cost in multiple machine learning benchmarks and two real-world health-care applications.

15.7LGAug 10, 2018
Model Reduction with Memory and the Machine Learning of Dynamical Systems

Chao Ma, Jianchun Wang, Weinan E

The well-known Mori-Zwanzig theory tells us that model reduction leads to memory effect. For a long time, modeling the memory effect accurately and efficiently has been an important but nearly impossible task in developing a good reduced model. In this work, we explore a natural analogy between recurrent neural networks and the Mori-Zwanzig formalism to establish a systematic approach for developing reduced models with memory. Two training models-a direct training model and a dynamically coupled training model-are proposed and compared. We apply these methods to the Kuramoto-Sivashinsky equation and the Navier-Stokes equation. Numerical experiments show that the proposed method can produce reduced model with good performance on both short-term prediction and long-term statistical properties.

23.0MLJun 6, 2018Code
Variational Implicit Processes

Chao Ma, Yingzhen Li, José Miguel Hernández-Lobato

We introduce the implicit processes (IPs), a stochastic process that places implicitly defined multivariate distributions over any finite collections of random variables. IPs are therefore highly flexible implicit priors over functions, with examples including data simulators, Bayesian neural networks and non-linear transformations of stochastic processes. A novel and efficient approximate inference algorithm for IPs, namely the variational implicit processes (VIPs), is derived using generalised wake-sleep updates. This method returns simple update equations and allows scalable hyper-parameter learning with stochastic optimization. Experiments show that VIPs return better uncertainty estimates and lower errors over existing inference methods for challenging models such as Bayesian neural networks, and Gaussian processes.