AI LGDec 12, 2025

Reliable Policy Iteration: Performance Robustness Across Architecture and Environment Perturbations

S. R. Eshwar, Aniruddha Mukherjee, Kintan Saha, Krishna Agarwal, Gugan Thoppe, Aditya Gopalan, Gal Dalal

arXiv:2512.12088v15.81 citationsh-index: 16

Originality Incremental advance

AI Analysis

This addresses reliability issues such as sample inefficiency and hyperparameter sensitivity for deep RL practitioners, but it is incremental as it builds on prior work.

The paper tackled the problem of performance robustness in deep reinforcement learning by evaluating Reliable Policy Iteration (RPI) on CartPole and Inverted Pendulum tasks, showing it reaches near-optimal performance early and sustains it compared to methods like DQN and PPO.

In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on two classical control tasks -- CartPole and Inverted Pendulum -- under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI's promise as a more reliable alternative.

View on arXiv PDF

Similar