LG AIOct 24, 2024

SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance

Liyu Zhang, Haochi Wu, Xu Wan, Quan Kong, Ruilong Deng, Mingyang Sun

arXiv:2410.18626v22.6h-index: 4

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in reinforcement learning for applications requiring efficient adaptation from offline to online data, though it is an incremental improvement over existing methods.

The paper tackles the inefficiency of offline-to-online reinforcement learning by eliminating the need to maintain offline datasets, introducing SAMG which freezes a pre-trained offline critic to guide online fine-tuning. It achieves state-of-the-art performance on the D4RL benchmark, with empirical results showing improved efficiency and lower estimation error.

Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline \Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.

View on arXiv PDF

Similar