LG AIMar 16

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Mumuksh Tayal, Manan Tayal, Ravi Prakash

arXiv:2603.1513620.8h-index: 9

AI Analysis

This addresses safety-critical real-time control problems for robotics and autonomous systems, offering an incremental improvement over existing methods.

The paper tackles offline safe reinforcement learning by proposing Safe Flow Q-Learning, which combines reachability-based safety values with flow policies to reduce constraint violations in tasks like boat navigation and Safety Gymnasium, achieving competitive performance with lower inference latency.

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

View on arXiv PDF

Similar