Improving Policy Optimization via $\varepsilon$-Retrain
This work addresses policy optimization in reinforcement learning, particularly for ensuring behavioral preferences, though it appears incremental in nature.
The paper tackles the problem of policy optimization by introducing ε-retrain, an exploration strategy that encourages behavioral preferences while maintaining monotonic improvement guarantees. The method achieves significant performance and sample efficiency improvements across locomotion, power network, and navigation tasks.
We present $\varepsilon$-retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor $\varepsilon$, allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.