Behavior Constraining in Weight Space for Offline Reinforcement Learning
This addresses the challenge of policy regularization in offline RL, but appears incremental as it modifies the constraint approach rather than introducing a new paradigm.
The paper tackles the problem of learning policies from a fixed dataset in offline reinforcement learning by proposing a new algorithm that constrains the policy directly in weight space, demonstrating its effectiveness in experiments.
In offline reinforcement learning, a policy needs to be learned from a single pre-collected dataset. Typically, policies are thus regularized during training to behave similarly to the data generating policy, by adding a penalty based on a divergence between action distributions of generating and trained policy. We propose a new algorithm, which constrains the policy directly in its weight space instead, and demonstrate its effectiveness in experiments.