OCAug 15, 2023
Projection-Free Methods for Stochastic Simple Bilevel Optimization with Convex Lower-level ProblemJincheng Cao, Ruichen Jiang, Nazanin Abolfazli et al.
In this paper, we study a class of stochastic bilevel optimization problems, also known as stochastic simple bilevel optimization, where we minimize a smooth stochastic objective function over the optimal solution set of another stochastic convex optimization problem. We introduce novel stochastic bilevel optimization methods that locally approximate the solution set of the lower-level problem via a stochastic cutting plane, and then run a conditional gradient update with variance reduction techniques to control the error induced by using stochastic gradients. For the case that the upper-level function is convex, our method requires $\tilde{\mathcal{O}}(\max\{1/ε_f^{2},1/ε_g^{2}\}) $ stochastic oracle queries to obtain a solution that is $ε_f$-optimal for the upper-level and $ε_g$-optimal for the lower-level. This guarantee improves the previous best-known complexity of $\mathcal{O}(\max\{1/ε_f^{4},1/ε_g^{4}\})$. Moreover, for the case that the upper-level function is non-convex, our method requires at most $\tilde{\mathcal{O}}(\max\{1/ε_f^{3},1/ε_g^{3}\}) $ stochastic oracle queries to find an $(ε_f, ε_g)$-stationary point. In the finite-sum setting, we show that the number of stochastic oracle calls required by our method are $\tilde{\mathcal{O}}(\sqrt{n}/ε)$ and $\tilde{\mathcal{O}}(\sqrt{n}/ε^{2})$ for the convex and non-convex settings, respectively, where $ε=\min \{ε_f,ε_g\}$.
66.6LGMay 11
Curriculum Learning-Guided Progressive Distillation in Large Language ModelsJincheng Cao, Fanzhi Zeng, Leqi Liu et al.
Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.
LGJan 22
CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language ModelsShuozhe Li, Jincheng Cao, Bodun Hu et al.
Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.
OCFeb 12, 2024
An Accelerated Gradient Method for Convex Smooth Simple Bilevel OptimizationJincheng Cao, Ruichen Jiang, Erfan Yazdandoost Hamedani et al.
In this paper, we focus on simple bilevel optimization problems, where we minimize a convex smooth objective function over the optimal solution set of another convex smooth constrained optimization problem. We present a novel bilevel optimization method that locally approximates the solution set of the lower-level problem using a cutting plane approach and employs an accelerated gradient-based update to reduce the upper-level objective function over the approximated solution set. We measure the performance of our method in terms of suboptimality and infeasibility errors and provide non-asymptotic convergence guarantees for both error criteria. Specifically, when the feasible set is compact, we show that our method requires at most $\mathcal{O}(\max\{1/\sqrt{ε_{f}}, 1/ε_g\})$ iterations to find a solution that is $ε_f$-suboptimal and $ε_g$-infeasible. Moreover, under the additional assumption that the lower-level objective satisfies the $r$-th Hölderian error bound, we show that our method achieves an iteration complexity of $\mathcal{O}(\max\{ε_{f}^{-\frac{2r-1}{2r}},ε_{g}^{-\frac{2r-1}{2r}}\})$, which matches the optimal complexity of single-level convex constrained optimization when $r=1$.
OCJul 30, 2025
On the Complexity of Finding Stationary Points in Nonconvex Simple Bilevel OptimizationJincheng Cao, Ruichen Jiang, Erfan Yazdandoost Hamedani et al.
In this paper, we study the problem of solving a simple bilevel optimization problem, where the upper-level objective is minimized over the solution set of the lower-level problem. We focus on the general setting in which both the upper- and lower-level objectives are smooth but potentially nonconvex. Due to the absence of additional structural assumptions for the lower-level objective-such as convexity or the Polyak-Łojasiewicz (PL) condition-guaranteeing global optimality is generally intractable. Instead, we introduce a suitable notion of stationarity for this class of problems and aim to design a first-order algorithm that finds such stationary points in polynomial time. Intuitively, stationarity in this setting means the upper-level objective cannot be substantially improved locally without causing a larger deterioration in the lower-level objective. To this end, we show that a simple and implementable variant of the dynamic barrier gradient descent (DBGD) framework can effectively solve the considered nonconvex simple bilevel problems up to stationarity. Specifically, to reach an $(ε_f, ε_g)$-stationary point-where $ε_f$ and $ε_g$ denote the target stationarity accuracies for the upper- and lower-level objectives, respectively-the considered method achieves a complexity of $\mathcal{O}\left(\max\left(ε_f^{-\frac{3+p}{1+p}}, ε_g^{-\frac{3+p}{2}}\right)\right)$, where $p \geq 0$ is an arbitrary constant balancing the terms. To the best of our knowledge, this is the first complexity result for a discrete-time algorithm that guarantees joint stationarity for both levels in general nonconvex simple bilevel problems.