STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence
This addresses the problem of safe policy optimization in robotics for researchers and practitioners, though it appears incremental as it builds on existing actor-critic and policy gradient methods.
The paper tackles the challenge of deploying risk-averse reinforcement learning by proposing STOPS, which uses short-term trajectories to avoid hazardous states and achieves global optimality with a sublinear convergence rate, matching state-of-the-art risk-neutral methods in Mujoco simulations.
It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.