AILGJun 12, 2024

Improving Policy Optimization via $\varepsilon$-Retrain

arXiv:2406.08315v27 citations
Originality Incremental advance
AI Analysis

This work addresses policy optimization in reinforcement learning, particularly for ensuring behavioral preferences, though it appears incremental in nature.

The paper tackles the problem of policy optimization by introducing ε-retrain, an exploration strategy that encourages behavioral preferences while maintaining monotonic improvement guarantees. The method achieves significant performance and sample efficiency improvements across locomotion, power network, and navigation tasks.

We present $\varepsilon$-retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor $\varepsilon$, allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes