Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning
This work addresses sample efficiency for researchers and practitioners in reinforcement learning, though it is incremental as it modifies an existing method with a new loss function.
The paper tackles the sample efficiency problem in batch reinforcement learning by proposing FQI-log, which uses log-loss instead of squared loss, and shows that the required samples scale with the optimal policy's cost, achieving zero cost in goal-reaching tasks with empirical verification of fewer samples.
We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.