LGAIMay 18

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

arXiv:2605.1832038.1
AI Analysis

Offline RL practitioners benefit from a method that balances safety and exploration, enabling better performance on tasks where the behavior policy's support is limited.

ISEP introduces a stochastic policy optimization method that implicitly expands the support of feasible actions in offline RL, achieving up to 20% improvement over baselines on D4RL benchmarks while maintaining safety guarantees.

Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes