Mode-Dependent Rectification for Stable PPO Training
This addresses a stability problem in visual reinforcement learning for researchers and practitioners using on-policy methods with mode-dependent components.
The paper tackles instability in Proximal Policy Optimization (PPO) caused by mode-dependent layers like Batch Normalization, proposing Mode-Dependent Rectification (MDR) to stabilize training without architectural changes. Experiments show MDR consistently improves stability and performance across procedurally generated games and real-world tasks.
Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.