LGAIMay 18, 2025

Imagination-Limited Q-Learning for Offline Reinforcement Learning

arXiv:2505.12211v11 citationsh-index: 4IJCAI
Originality Incremental advance
AI Analysis

This addresses a key challenge in offline RL for improving policy learning from historical data, representing an incremental advance over existing constraint-based methods.

The paper tackles the problem of over-optimistic value estimates for out-of-distribution actions in offline reinforcement learning by proposing Imagination-Limited Q-learning, which uses a dynamics model to imagine and clip action-values, achieving state-of-the-art performance on D4RL benchmark tasks.

Offline reinforcement learning seeks to derive improved policies entirely from historical data but often struggles with over-optimistic value estimates for out-of-distribution (OOD) actions. This issue is typically mitigated via policy constraint or conservative value regularization methods. However, these approaches may impose overly constraints or biased value estimates, potentially limiting performance improvements. To balance exploitation and restriction, we propose an Imagination-Limited Q-learning (ILQ) method, which aims to maintain the optimism that OOD actions deserve within appropriate limits. Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values. Such design maintains reasonable evaluation of OOD actions to the furthest extent, while avoiding its over-optimism. Theoretically, we prove the convergence of the proposed ILQ under tabular Markov decision processes. Particularly, we demonstrate that the error bound between estimated values and optimality values of OOD state-actions possesses the same magnitude as that of in-distribution ones, thereby indicating that the bias in value estimates is effectively mitigated. Empirically, our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes