h-index9
4papers
10citations
Novelty60%
AI Score42

4 Papers

LGApr 18, 2023
Feasible Policy Iteration for Safe Reinforcement Learning

Yujie Yang, Zhilong Zheng, Shengbo Eben Li et al.

Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL remains absent. In this paper, we propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. FPI alternates between policy evaluation, region identification and policy improvement. This follows actor-critic-scenery (ACS) framework where scenery refers to a feasibility function that represents a feasible region. A region-wise update rule is developed for the policy improvement step, which maximizes state-value function inside the feasible region and minimizes feasibility function outside it. With this update rule, FPI guarantees monotonic expansion of feasible region, monotonic improvement of state-value function, and geometric convergence to the optimal safe policy. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks and outperforms existing methods in constraint adherence and reward performance on high-dimensional tasks.

SYJan 29
The Feasibility Theory of Constrained Reinforcement Learning: A Tutorial Study

Yujie Yang, Zhilong Zheng, Masayoshi Tomizuka et al.

Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy. However, reinforcement learning (RL), as another important control method, solves the optimal policy in an iterative manner, which comes with a series of non-optimal intermediate policies. Feasibility analysis of these non-optimal policies is also necessary for iteratively improving constraint satisfaction; but that is not available under existing MPC feasibility theories. This paper proposes a feasibility theory that applies to both MPC and RL by filling in the missing part of feasibility analysis for an arbitrary policy. The basis of our theory is to decouple policy solving and implementation into two temporal domains: virtual-time domain and real-time domain. This allows us to separately define initial and endless, state and policy feasibility, and their corresponding feasible regions. Based on these definitions, we analyze the containment relationships between different feasible regions, which enables us to describe the feasible region of an arbitrary policy. We further provide virtual-time constraint design rules along with a practical design tool called feasibility function that helps to achieve the maximum feasible region. We review most of existing constraint formulations and point out that they are essentially applications of feasibility functions in different forms. We demonstrate our feasibility theory by visualizing different feasible regions under both MPC and RL policies in an emergency braking control task.

CLFeb 17
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu, Zeyu He, Guojian Zhan et al.

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. Our analysis shows that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. To mitigate this instability, we design S2T (silencing spurious tokens) mechanism to efficiently identify spurious tokens through characteristic signals with low probability, low entropy, and positive advantage, and then to suppress their gradient perturbations during optimization. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.69\% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.

DCDec 17, 2024
TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

ChonLam Lao, Minlan Yu, Aditya Akella et al.

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.