On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark
This work addresses the need for better evaluation methods in deep RL, highlighting a fundamental gap in existing metrics, which is incremental as it builds on tabular RL theory but reveals new challenges specific to non-tabular settings.
The paper tackled the problem of principled evaluation in deep reinforcement learning by investigating whether tabular hardness metrics can guide non-tabular benchmarking, and found that representation hardness dominates difficulty, with tabular metrics being poor predictors of deep RL agent performance.
Principled evaluation is critical for progress in deep reinforcement learning (RL), yet it lags behind the theory-driven benchmarks of tabular RL. While tabular settings benefit from well-understood hardness measures like MDP diameter and suboptimality gaps, deep RL benchmarks are often chosen based on intuition and popularity. This raises a critical question: can tabular hardness metrics be adapted to guide non-tabular benchmarking? We investigate this question and reveal a fundamental gap. Our primary contribution is demonstrating that the difficulty of non-tabular environments is dominated by a factor that tabular metrics ignore: representation hardness. The same underlying MDP can pose vastly different challenges depending on whether the agent receives state vectors or pixel-based observations. To enable this analysis, we introduce \texttt{pharos}, a new open-source library for principled RL benchmarking that allows for systematic control over both environment structure and agent representations. Our extensive case study using \texttt{pharos} shows that while tabular metrics offer some insight, they are poor predictors of deep RL agent performance on their own. This work highlights the urgent need for new, representation-aware hardness measures and positions \texttt{pharos} as a key tool for developing them.