LGMay 7
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise RatioZiqing Wen, Zhouyang Liu, Jiahuan Wang et al.
The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) provide per-parameter adaptivity, they do not explicitly account for module-level gradient heterogeneity, resulting in slower convergence, suboptimal performance, or training instability. Existing approaches typically rely on manually tuned module-specific learning rates or specific optimization strategies, which are computationally costly and difficult to generalize across tasks or models. To establish a more principled approach, we first analyze the noise-damping behavior of Adam in high-noise modules and introduce \textbf{Module-wise Learning Rate Scaling via SNR (MoLS)}. MoLS estimates module-level SNRs to scale Adam updates, allowing automated module-wise learning rate allocation without manual tuning. Empirical results through multiple LLM training benchmarks demonstrate that MoLS improves convergence speed and generalization, achieving performance comparable to carefully tuned module-specific learning rates, while remaining compatible with memory-efficient training algorithms.
LGSep 3, 2024
Federated Prediction-Powered Inference from Decentralized DataPing Luo, Xiaoge Deng, Ziqing Wen et al.
In various domains, the increasing application of machine learning allows researchers to access inexpensive predictive data, which can be utilized as auxiliary data for statistical inference. Although such data are often unreliable compared to gold-standard datasets, Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability. However, the challenge of `data silos' arises when the private gold-standard datasets are non-shareable for model training, leading to less accurate predictive models and invalid inferences. In this paper, we introduces the Federated Prediction-Powered Inference (Fed-PPI) framework, which addresses this challenge by enabling decentralized experimental data to contribute to statistically valid conclusions without sharing private information. The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI computation. The proposed framework is evaluated through experiments, demonstrating its effectiveness in producing valid confidence intervals.
LGJan 7
Local Gradient Regulation Stabilizes Federated Learning under Client HeterogeneityPing Luo, Jiahuan Wang, Ziqing Wen et al.
Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, yet its stability is fundamentally challenged by statistical heterogeneity in realistic deployments. Here, we show that client heterogeneity destabilizes FL primarily by distorting local gradient dynamics during client-side optimization, causing systematic drift that accumulates across communication rounds and impedes global convergence. This observation highlights local gradients as a key regulatory lever for stabilizing heterogeneous FL systems. Building on this insight, we develop a general client-side perspective that regulates local gradient contributions without incurring additional communication overhead. Inspired by swarm intelligence, we instantiate this perspective through Exploratory--Convergent Gradient Re-aggregation (ECGR), which balances well-aligned and misaligned gradient components to preserve informative updates while suppressing destabilizing effects. Theoretical analysis and extensive experiments, including evaluations on the LC25000 medical imaging dataset, demonstrate that regulating local gradient dynamics consistently stabilizes federated learning across state-of-the-art methods under heterogeneous data distributions.
LGDec 8, 2025
FOAM: Blocked State Folding for Memory-Efficient LLM TrainingZiqing Wen, Jiahuan Wang, Ping Luo et al.
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50\%, eliminates up to 90\% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.
LGMay 11
Unveiling High-Probability Generalization in Decentralized SGDJiahuan Wang, Ping Luo, Ziqing Wen et al.
Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{δ\sqrt{mn}}\right)$, where $δ$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/δ)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/δ)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.
LGMay 3
Stability and Generalization for Decentralized Markov SGDJiahuan Wang, Ziqing Wen, Ping Luo et al.
Stochastic gradient methods are central to large-scale learning, yet their generalization theory typically relies on independent sampling assumptions. In many practical applications, data are generated by Markov chains and learning is performed in a decentralized manner, which introduces significant analytical challenges. In this work, we investigate the stability and generalization of decentralized stochastic gradient descent (SGD) and stochastic gradient descent ascent (SGDA) under Markov chain sampling. Leveraging a stability-based framework, we characterize how Markovian dependence and decentralized communication jointly influence generalization behavior. Our analysis captures the effects of network topology, Markov chain mixing properties, and primal-dual dynamics. We establish non-asymptotic generalization bounds for both algorithms, extending existing results on Markov stochastic gradient methods to decentralized and minimax settings.
LGMay 22, 2024
Score-based Generative Models with Adaptive MomentumZiqing Wen, Xiaoge Deng, Ping Luo et al.
Score-based generative models have demonstrated significant practical success in data-generating tasks. The models establish a diffusion process that perturbs the ground truth data to Gaussian noise and then learn the reverse process to transform noise into data. However, existing denoising methods such as Langevin dynamic and numerical stochastic differential equation solvers enjoy randomness but generate data slowly with a large number of score function evaluations, and the ordinary differential equation solvers enjoy faster sampling speed but no randomness may influence the sample quality. To this end, motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters. Theoretically, we proved our method promises convergence under given conditions. In addition, we empirically show that our sampler can produce more faithful images/graphs in small sampling steps with 2 to 5 times speed up and obtain competitive scores compared to the baselines on image and graph generation tasks.
LGJan 13, 2025
Wavelet Meets Adam: Compressing Gradients for Memory-Efficient TrainingZiqing Wen, Ping Luo, Jiahuan Wang et al.
Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.
LGMar 25, 2024
Accelerating Federated Learning by Selecting Beneficial Herd of Local GradientsPing Luo, Xiaoge Deng, Ziqing Wen et al.
Federated Learning (FL) is a distributed machine learning framework in communication network systems. However, the systems' Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples are beneficial for model convergence. In pursuit of this subset, a reliable approach involves determining a measure of validity to rank the samples within the dataset. In this paper, We propose the BHerd strategy which selects a beneficial herd of local gradients to accelerate the convergence of the FL model. Specifically, we map the distribution of the local dataset to the local gradients and use the Herding strategy to obtain a permutation of the set of gradients, where the more advanced gradients in the permutation are closer to the average of the set of gradients. These top portion of the gradients will be selected and sent to the server for global aggregation. We conduct experiments on different datasets, models and scenarios by building a prototype system, and experimental results demonstrate that our BHerd strategy is effective in selecting beneficial local gradients to mitigate the effects brought by the Non-IID dataset, thus accelerating model convergence.