AILGDec 12, 2025

Reliable Policy Iteration: Performance Robustness Across Architecture and Environment Perturbations

arXiv:2512.12088v11 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses reliability issues such as sample inefficiency and hyperparameter sensitivity for deep RL practitioners, but it is incremental as it builds on prior work.

The paper tackled the problem of performance robustness in deep reinforcement learning by evaluating Reliable Policy Iteration (RPI) on CartPole and Inverted Pendulum tasks, showing it reaches near-optimal performance early and sustains it compared to methods like DQN and PPO.

In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on two classical control tasks -- CartPole and Inverted Pendulum -- under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI's promise as a more reliable alternative.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes