Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation
This work addresses the challenge of deploying RL in real-world applications with environment mismatches, offering a scalable solution for high-dimensional tasks, though it builds incrementally on existing DR-RL methods.
The paper tackles the problem of reinforcement learning agents performing poorly when training and deployment environments differ, by proposing an online distributionally robust RL algorithm that learns optimal robust policies through environment interaction without prior models or offline data, achieving a near-optimal sublinear regret bound.
The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.