LG AIMay 14

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

arXiv:2602.0744153.61 citationsh-index: 9

AI Analysis

For researchers and practitioners in offline reinforcement learning, this work addresses a fundamental limitation of BC-regularized methods with a simple, plug-and-play solution that yields consistent gains.

The paper identifies a performance ceiling in behavior cloning-regularized actor-critic methods for offline RL due to indiscriminate imitation of suboptimal dataset actions, and proposes proximal action replacement (PAR) to substitute suboptimal actions with better ones, consistently improving performance across benchmarks and achieving state-of-the-art results when combined with TD3+BC.

Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.

View on arXiv PDF

Similar