LGJun 1, 2023

Improving Offline RL by Blending Heuristics

arXiv:2306.00321v212 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses performance limitations in offline RL algorithms for applications like robotics and autonomous systems, representing an incremental improvement.

The paper tackles the problem of improving offline reinforcement learning (RL) by proposing Heuristic Blending (HUBL), a technique that modifies Bellman operators to blend bootstrapped values with heuristic Monte-Carlo returns, resulting in an average 9% policy quality improvement over 27 datasets on D4RL and Meta-World benchmarks.

We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies the Bellman operators used in these algorithms, partially replacing the bootstrapped values with heuristic ones that are estimated with Monte-Carlo returns. For trajectories with higher returns, HUBL relies more on the heuristic values and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. HUBL is very easy to combine with many existing offline RL implementations by relabeling the offline datasets with adjusted rewards and discount factors. We derive a theory that explains HUBL's effect on offline RL as reducing offline RL's complexity and thus increasing its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta-World benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes