OCAug 21, 2023
A Homogenization Approach for Gradient-Dominated Stochastic OptimizationJiyuan Tan, Chenyu Xue, Chuwen Zhang et al.
Gradient dominance property is a condition weaker than strong convexity, yet sufficiently ensures global convergence even in non-convex optimization. This property finds wide applications in machine learning, reinforcement learning (RL), and operations management. In this paper, we propose the stochastic homogeneous second-order descent method (SHSODM) for stochastic functions enjoying gradient dominance property based on a recently proposed homogenization approach. Theoretically, we provide its sample complexity analysis, and further present an enhanced result by incorporating variance reduction techniques. Our findings show that SHSODM matches the best-known sample complexity achieved by other second-order methods for gradient-dominated stochastic optimization but without cubic regularization. Empirically, since the homogenization approach only relies on solving extremal eigenvector problem at each iteration instead of Newton-type system, our methods gain the advantage of cheaper computational cost and robustness in ill-conditioned problems. Numerical experiments on several RL tasks demonstrate the better performance of SHSODM compared to other off-the-shelf methods.
LGAug 21, 2023Code
GBM-based Bregman Proximal Algorithms for Constrained LearningZhenwei Lin, Qi Deng
As the complexity of learning tasks surges, modern machine learning encounters a new constrained learning paradigm characterized by more intricate and data-driven function constraints. Prominent applications include Neyman-Pearson classification (NPC) and fairness classification, which entail specific risk constraints that render standard projection-based training algorithms unsuitable. Gradient boosting machines (GBMs) are among the most popular algorithms for supervised learning; however, they are generally limited to unconstrained settings. In this paper, we adapt the GBM for constrained learning tasks within the framework of Bregman proximal algorithms. We introduce a new Bregman primal-dual method with a global optimality guarantee when the learning objective and constraint functions are convex. In cases of nonconvex functions, we demonstrate how our algorithm remains effective under a Bregman proximal point framework. Distinct from existing constrained learning algorithms, ours possess a unique advantage in their ability to seamlessly integrate with publicly available GBM implementations such as XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017), exclusively relying on their public interfaces. We provide substantial experimental evidence to showcase the effectiveness of the Bregman algorithm framework. While our primary focus is on NPC and fairness ML, our framework holds significant potential for a broader range of constrained learning applications. The source code is currently freely available at https://github.com/zhenweilin/ConstrainedGBM}{https://github.com/zhenweilin/ConstrainedGBM.
OCJan 28, 2023
Stochastic Dimension-reduced Second-order Methods for Policy OptimizationJinsong Liu, Chenghan Xie, Qi Deng et al.
In this paper, we propose several new stochastic second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration, making them computationally efficient and comparable to policy gradient methods. Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem. We show that DR-SOPO obtains an $\mathcal{O}(ε^{-3.5})$ complexity for reaching approximate first-order stationary condition and certain subspace second-order stationary condition. In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $\mathcal{O}(ε^{-3})$ based on the variance reduction technique. Preliminary experiments show that our proposed algorithms perform favorably compared with stochastic and variance-reduced policy gradient methods.
OCApr 10, 2023
First-order methods for Stochastic Variational Inequality problems with Function ConstraintsDigvijay Boob, Qi Deng, Mohammad Khalafi
The monotone Variational Inequality (VI) is a general model with important applications in various engineering and scientific domains. In numerous instances, the VI problems are accompanied by function constraints that can be data-driven, making the usual projection operator challenging to compute. This paper presents novel first-order methods for the function-constrained Variational Inequality (FCVI) problem in smooth or nonsmooth settings with possibly stochastic operators and constraints. We introduce the AdOpEx method, which employs an operator extrapolation on the KKT operator of the FCVI in a smooth deterministic setting. Since this operator is not uniformly Lipschitz continuous in the Lagrange multipliers, we employ an adaptive two-timescale algorithm leading to bounded multipliers and achieving the optimal $O(1/T)$ convergence rate. For the nonsmooth and stochastic VIs, we introduce design changes to the AdOpEx method and propose a novel P-OpEx method that takes partial extrapolation. It converges at the rate of $O(1/\sqrt{T})$ when both the operator and constraints are stochastic or nonsmooth. This method has suboptimal dependence on the noise and Lipschitz constants of function constraints. We propose a constraint extrapolation approach leading to the OpConEx method that improves this dependence by an order of magnitude. All our algorithms easily extend to saddle point problems with function constraints that couple the primal and dual variables while maintaining the same complexity results. To the best of our knowledge, all our complexity results are new in the literature
47.8CVMar 15
ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order OptimizationRonghao Zhang, Shuaicheng Niu, Qi Deng et al.
Test-time adaptation (TTA) aims to improve model robustness under distribution shifts by adapting to unlabeled test data, but most existing methods rely on backpropagation (BP), which is computationally costly and incompatible with non-differentiable models such as quantized models, limiting practical deployment on numerous edge devices. Recent BP-free approaches alleviate overhead but remain either architecture-specific or limited in optimization capacity to handle high-dimensional models. We propose ZOTTA, a fully BP-free TTA framework that performs efficient adaptation using only forward passes via Zeroth-Order Optimization (ZOO). While ZOO is theoretically appealing, naive application leads to slow convergence under high-dimensional parameter spaces and unstable optimization due to the lack of labels. ZOTTA overcomes these challenges through 1) Distribution-Robust Layer Selection, which automatically identifies and freezes layers that already extract distribution-invariant features, updating only domain-sensitive layers to reduce the optimization dimensionality and accelerate convergence; 2) Spatial Feature Aggregation Alignment, which stabilizes ZOO by aligning globally aggregated spatial features between source and target to reduce gradient variance. Together, these components enable architecture-agnostic and stable BP-free adaptation. Extensive experiments on ImageNet-C/R/Sketch/A show that ZOTTA outperforms or matches BP-based methods, e.g., it reduces memory usage by 84% and improves accuracy by 3.9% over SAR on ImageNet-C.
OCDec 21, 2022
Faster Accelerated First-order Methods for Convex Optimization with Strongly Convex Function ConstraintsZhenwei Lin, Qi Deng
In this paper, we introduce faster accelerated primal-dual algorithms for minimizing a convex function subject to strongly convex function constraints. Prior to our work, the best complexity bound was $\mathcal{O}(1/{\varepsilon})$, regardless of the strong convexity of the constraint function. It is unclear whether the strong convexity assumption can enable even better convergence results. To address this issue, we have developed novel techniques to progressively estimate the strong convexity of the Lagrangian function. Our approach, for the first time, effectively leverages the constraint strong convexity, obtaining an improved complexity of $\mathcal{O}(1/\sqrt{\varepsilon})$. This rate matches the complexity lower bound for strongly-convex-concave saddle point optimization and is therefore order-optimal. We show the superior performance of our methods in sparsity-inducing constrained optimization, notably Google's personalized PageRank problem. Furthermore, we show that a restarted version of the proposed methods can effectively identify the optimal solution's sparsity pattern within a finite number of steps, a result that appears to have independent significance.
CVJun 14, 2025Code
Exploring Audio Cues for Enhanced Test-Time Video Model AdaptationRunhao Zeng, Qi Deng, Ronghao Zhang et al.
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
OCJun 3, 2025Code
BenLOC: A Benchmark for Learning to Configure MIP OptimizersHongpei Li, Ziyan He, Yufei Wang et al.
The automatic configuration of Mixed-Integer Programming (MIP) optimizers has become increasingly critical as the large number of configurations can significantly affect solver performance. Yet the lack of standardized evaluation frameworks has led to data leakage and over-optimistic claims, as prior studies often rely on homogeneous datasets and inconsistent experimental setups. To promote a fair evaluation process, we present BenLOC, a comprehensive benchmark and open-source toolkit, which not only offers an end-to-end pipeline for learning instance-wise MIP optimizer configurations, but also standardizes dataset selection, train-test splits, feature engineering and baseline choice for unbiased and comprehensive evaluations. Leveraging this framework, we conduct an empirical analysis on five well-established MIP datasets and compare classical machine learning models with handcrafted features against state-of-the-art deep-learning techniques. The results demonstrate the importance of datasets, features and baseline criteria proposed by BenLOC and the effectiveness of BenLOC in providing unbiased and comprehensive evaluations.
CEDec 12, 2025
Integrated Prediction and Multi-period Portfolio OptimizationYuxuan Linghu, Zhiyuan Liu, Qi Deng
Multi-period portfolio optimization is important for real portfolio management, as it accounts for transaction costs, path-dependent risks, and the intertemporal structure of trading decisions that single-period models cannot capture. Classical methods usually follow a two-stage framework: machine learning algorithms are employed to produce forecasts that closely fit the realized returns, and the predicted values are then used in a downstream portfolio optimization problem to determine the asset weights. This separation leads to a fundamental misalignment between predictions and decision outcomes, while also ignoring the impact of transaction costs. To bridge this gap, recent studies have proposed the idea of end-to-end learning, integrating the two stages into a single pipeline. This paper introduces IPMO (Integrated Prediction and Multi-period Portfolio Optimization), a model for multi-period mean-variance portfolio optimization with turnover penalties. The predictor generates multi-period return forecasts that parameterize a differentiable convex optimization layer, which in turn drives learning via portfolio performance. For scalability, we introduce a mirror-descent fixed-point (MDFP) differentiation scheme that avoids factorizing the Karush-Kuhn-Tucker (KKT) systems, which thus yields stable implicit gradients and nearly scale-insensitive runtime as the decision horizon grows. In experiments with real market data and two representative time-series prediction models, the IPMO method consistently outperforms the two-stage benchmarks in risk-adjusted performance net of transaction costs and achieves more coherent allocation paths. Our results show that integrating machine learning prediction with optimization in the multi-period setting improves financial outcomes and remains computationally tractable.
LGDec 22, 2024
Learning to Generate Gradients for Test-Time Adaptation via Test-Time Training LayersQi Deng, Shuaicheng Niu, Ronghao Zhang et al.
Test-time adaptation (TTA) aims to fine-tune a trained model online using unlabeled testing data to adapt to new environments or out-of-distribution data, demonstrating broad application potential in real-world scenarios. However, in this optimization process, unsupervised learning objectives like entropy minimization frequently encounter noisy learning signals. These signals produce unreliable gradients, which hinder the model ability to converge to an optimal solution quickly and introduce significant instability into the optimization process. In this paper, we seek to resolve these issues from the perspective of optimizer design. Unlike prior TTA using manually designed optimizers like SGD, we employ a learning-to-optimize approach to automatically learn an optimizer, called Meta Gradient Generator (MGG). Specifically, we aim for MGG to effectively utilize historical gradient information during the online optimization process to optimize the current model. To this end, in MGG, we design a lightweight and efficient sequence modeling layer -- gradient memory layer. It exploits a self-supervised reconstruction loss to compress historical gradient information into network parameters, thereby enabling better memorization ability over a long-term adaptation process. We only need a small number of unlabeled samples to pre-train MGG, and then the trained MGG can be deployed to process unseen samples. Promising results on ImageNet-C, R, Sketch, and A indicate that our method surpasses current state-of-the-art methods with fewer updates, less data, and significantly shorter adaptation iterations. Compared with a previous SOTA method SAR, we achieve 7.4% accuracy improvement and 4.2 times faster adaptation speed on ImageNet-C.
LGNov 26, 2024
Spatio-temporal Causal Learning for Streamflow ForecastingShu Wan, Reepal Shah, Qi Deng et al.
Streamflow plays an essential role in the sustainable planning and management of national water resources. Traditional hydrologic modeling approaches simulate streamflow by establishing connections across multiple physical processes, such as rainfall and runoff. These data, inherently connected both spatially and temporally, possess intrinsic causal relations that can be leveraged for robust and accurate forecasting. Recently, spatio-temporal graph neural networks (STGNNs) have been adopted, excelling in various domains, such as urban traffic management, weather forecasting, and pandemic control, and they also promise advances in streamflow management. However, learning causal relationships directly from vast observational data is theoretically and computationally challenging. In this study, we employ a river flow graph as prior knowledge to facilitate the learning of the causal structure and then use the learned causal graph to predict streamflow at targeted sites. The proposed model, Causal Streamflow Forecasting (CSF) is tested in a real-world study in the Brazos River basin in Texas. Our results demonstrate that our method outperforms regular spatio-temporal graph neural networks and achieves higher computational efficiency compared to traditional simulation methods. By effectively integrating river flow graphs with STGNNs, this research offers a novel approach to streamflow prediction, showcasing the potential of combining advanced neural network techniques with domain-specific knowledge for enhanced performance in hydrologic modeling.
LGFeb 15
A Penalty Approach for Differentiation Through Black-Box Quadratic Programming SolversYuxuan Linghu, Zhiyuan Liu, Qi Deng
Differentiating through the solution of a quadratic program (QP) is a central problem in differentiable optimization. Most existing approaches differentiate through the Karush--Kuhn--Tucker (KKT) system, but their computational cost and numerical robustness can degrade at scale. To address these limitations, we propose dXPP, a penalty-based differentiation framework that decouples QP solving from differentiation. In the solving step (forward pass), dXPP is solver-agnostic and can leverage any black-box QP solver. In the differentiation step (backward pass), we map the solution to a smooth approximate penalty problem and implicitly differentiate through it, requiring only the solution of a much smaller linear system in the primal variables. This approach bypasses the difficulties inherent in explicit KKT differentiation and significantly improves computational efficiency and robustness. We evaluate dXPP on various tasks, including randomly generated QPs, large-scale sparse projection problems, and a real-world multi-period portfolio optimization task. Empirical results demonstrate that dXPP is competitive with KKT-based differentiation methods and achieves substantial speedups on large-scale problems.
CVOct 30, 2024
First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Spatiotemporal Agent Detection 2024Tengfei Zhang, Heng Zhang, Ruyang Li et al.
This report presents our team's solutions for the Track 1 of the 2024 ECCV ROAD++ Challenge. The task of Track 1 is spatiotemporal agent detection, which aims to construct an "agent tube" for road agents in consecutive video frames. Our solutions focus on the challenges in this task, including extreme-size objects, low-light scenarios, class imbalance, and fine-grained classification. Firstly, the extreme-size object detection heads are introduced to improve the detection performance of large and small objects. Secondly, we design a dual-stream detection model with a low-light enhancement stream to improve the performance of spatiotemporal agent detection in low-light scenes, and the feature fusion module to integrate features from different branches. Subsequently, we develop a multi-branch detection framework to mitigate the issues of class imbalance and fine-grained classification, and we design a pre-training and fine-tuning approach to optimize the above multi-branch framework. Besides, we employ some common data augmentation techniques, and improve the loss function and upsampling operation. We rank first in the test set of Track 1 for the ROAD++ Challenge 2024, and achieve 30.82% average video-mAP.
OCJan 25, 2024
Stochastic Weakly Convex Optimization Beyond Lipschitz ContinuityWenzhi Gao, Qi Deng
This paper considers stochastic weakly convex optimization without the standard Lipschitz continuity assumption. Based on new adaptive regularization (stepsize) strategies, we show that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the $\mathcal{O} ( 1 / \sqrt{K})$ convergence rate with constant failure rate. Our analyses rest on rather weak assumptions: the Lipschitz parameter can be either bounded by a general growth function of $\|x\|$ or locally estimated through independent random samples.
OCJun 6, 2021
Minibatch and Momentum Model-based Methods for Stochastic Weakly Convex OptimizationQi Deng, Wenzhi Gao
Stochastic model-based methods have received increasing attention lately due to their appealing robustness to the stepsize selection and provable efficiency guarantee. We make two important extensions for improving model-based methods on stochastic weakly convex optimization. First, we propose new minibatch model-based methods by involving a set of samples to approximate the model function in each iteration. For the first time, we show that stochastic algorithms achieve linear speedup over the batch size even for non-smooth and non-convex (particularly, weakly convex) problems. To this end, we develop a novel sensitivity analysis of the proximal mapping involved in each algorithm iteration. Our analysis appears to be of independent interests in more general settings. Second, motivated by the success of momentum stochastic gradient descent, we propose a new stochastic extrapolated model-based method, greatly extending the classic Polyak momentum technique to a wider class of stochastic algorithms for weakly convex optimization. The rate of convergence to some natural stationarity condition is established over a fairly flexible range of extrapolation terms. While mainly focusing on weakly convex optimization, we also extend our work to convex optimization. We apply the minibatch and extrapolated model-based methods to stochastic convex optimization, for which we provide a new complexity bound and promising linear speedup in batch size. Moreover, an accelerated model-based method based on Nesterov's momentum is presented, for which we establish an optimal complexity bound for reaching optimality.
OCOct 23, 2020
A Feasible Level Proximal Point Method for Nonconvex Sparse Constrained OptimizationDigvijay Boob, Qi Deng, Guanghui Lan et al.
Nonconvex sparse models have received significant attention in high-dimensional machine learning. In this paper, we study a new model consisting of a general convex or nonconvex objectives and a variety of continuous nonconvex sparsity-inducing constraints. For this constrained model, we propose a novel proximal point algorithm that solves a sequence of convex subproblems with gradually relaxed constraint levels. Each subproblem, having a proximal point objective and a convex surrogate constraint, can be efficiently solved based on a fast routine for projection onto the surrogate constraint. We establish the asymptotic convergence of the proposed algorithm to the Karush-Kuhn-Tucker (KKT) solutions. We also establish new convergence complexities to achieve an approximate KKT solution when the objective can be smooth/nonsmooth, deterministic/stochastic and convex/nonconvex with complexity that is on a par with gradient descent for unconstrained optimization problems in respective cases. To the best of our knowledge, this is the first study of the first-order methods with complexity guarantee for nonconvex sparse-constrained problems. We perform numerical experiments to demonstrate the effectiveness of our new model and efficiency of the proposed algorithm for large scale problems.
AISep 26, 2019
Artificial Intelligence BlockCloud (AIBC) Technical WhitepaperQi Deng
The AIBC is an Artificial Intelligence and blockchain technology based large-scale decentralized ecosystem that allows system-wide low-cost sharing of computing and storage resources. The AIBC consists of four layers: a fundamental layer, a resource layer, an application layer, and an ecosystem layer. The AIBC implements a two-consensus scheme to enforce upper-layer economic policies and achieve fundamental layer performance and robustness: the DPoEV incentive consensus on the application and resource layers, and the DABFT distributed consensus on the fundamental layer. The DABFT uses deep learning techniques to predict and select the most suitable BFT algorithm in order to achieve the best balance of performance, robustness, and security. The DPoEV uses the knowledge map algorithm to accurately assess the economic value of digital assets.
OCSep 3, 2019
Efficiency of Coordinate Descent Methods For Structured Nonconvex OptimizationQi Deng, Chenghao Lan
Novel coordinate descent (CD) methods are proposed for minimizing nonconvex functions consisting of three terms: (i) a continuously differentiable term, (ii) a simple convex term, and (iii) a concave and continuous term. First, by extending randomized CD to nonsmooth nonconvex settings, we develop a coordinate subgradient method that randomly updates block-coordinate variables by using block composite subgradient mapping. This method converges asymptotically to critical points with proven sublinear convergence rate for certain optimality measures. Second, we develop a randomly permuted CD method with two alternating steps: linearizing the concave part and cycling through variables. We prove asymptotic convergence to critical points and sublinear complexity rate for objectives with both smooth and concave parts. Third, we extend accelerated coordinate descent (ACD) to nonsmooth and nonconvex optimization to develop a novel randomized proximal DC algorithm whereby we solve the subproblem inexactly by ACD. Convergence is guaranteed with at most a few number of ACD iterations for each DC subproblem, and convergence complexity is established for identification of some approximate critical points. Fourth, we further develop the third method to minimize certain ill-conditioned nonconvex functions: weakly convex functions with high Lipschitz constant to negative curvature ratios. We show that, under specific criteria, the ACD-based randomized method has superior complexity compared to conventional gradient methods. Finally, an empirical study on sparsity-inducing learning models demonstrates that CD methods are superior to gradient-based methods for certain large-scale problems.
OCAug 7, 2019
Stochastic First-order Methods for Convex and Nonconvex Functional Constrained OptimizationDigvijay Boob, Qi Deng, Guanghui Lan
Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning, and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with a stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Contrary to existing primal-dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problems, we introduce a new proximal point method that transforms the initial nonconvex problem into a sequence of convex problems by adding quadratic terms to both the objective and constraints. Under a certain MFCQ-type assumption, we establish the convergence and rate of convergence of this method to KKT points when the convex subproblems are solved exactly or inexactly. For large-scale and stochastic problems, we present a more practical proximal point method in which the approximate solutions of the subproblems are computed by the aforementioned ConEx method. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.
MLOct 1, 2018
Optimal Adaptive and Accelerated Stochastic Gradient DescentQi Deng, Yi Cheng, Guanghui Lan
Stochastic gradient descent (\textsc{Sgd}) methods are the most powerful optimization tools in training machine learning and deep learning models. Moreover, acceleration (a.k.a. momentum) methods and diagonal scaling (a.k.a. adaptive gradient) methods are the two main techniques to improve the slow convergence of \textsc{Sgd}. While empirical studies have demonstrated potential advantages of combining these two techniques, it remains unknown whether these methods can achieve the optimal rate of convergence for stochastic optimization. In this paper, we present a new class of adaptive and accelerated stochastic gradient descent methods and show that they exhibit the optimal sampling and iteration complexity for stochastic optimization. More specifically, we show that diagonal scaling, initially designed to improve vanilla stochastic gradient, can be incorporated into accelerated stochastic gradient descent to achieve the optimal rate of convergence for smooth stochastic optimization. We also show that momentum, apart from being known to speed up the convergence rate of deterministic optimization, also provides us new ways of designing non-uniform and aggressive moving average schemes in stochastic optimization. Finally, we present some heuristics that help to implement adaptive accelerated stochastic gradient descent methods and to further improve their practical performance for machine learning and deep learning.