LGITSTMLMay 17, 2023

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

arXiv:2305.10282v118 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of sample-efficient policy fine-tuning in RL for scenarios with limited data, though it is incremental in building on existing reward-agnostic and model-based methods.

The paper tackles the problem of efficiently combining offline datasets and online interactions in tabular reinforcement learning, proposing a three-stage hybrid algorithm that achieves lower sample complexity than pure offline or online RL without requiring reward information during data collection.

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and model-based offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL -- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection. Our theory is developed based on a new notion called single-policy partial concentrability, which captures the trade-off between distribution mismatch and miscoverage and guides the interplay between offline and online data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes