LGFeb 4

Stochastic Decision Horizons for Constrained Reinforcement Learning

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

arXiv:2602.04599v11.4h-index: 10

Originality Incremental advance

AI Analysis

This work addresses the problem of handling constraints like safety in reinforcement learning for researchers and practitioners, offering an incremental improvement over existing methods.

The paper tackles the challenge of off-policy scalability in constrained reinforcement learning by proposing a Control as Inference formulation with stochastic decision horizons, which improves sample efficiency and return-violation trade-offs on benchmarks, including scaling to a high-dimensional musculoskeletal setup.

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

View on arXiv PDF

Similar