Bellman Calibration for V-Learning in Offline Reinforcement Learning
This addresses the challenge of reliable value estimation in offline reinforcement learning, particularly for practitioners needing calibrated predictions without strong model assumptions, though it appears incremental as an adaptation of classical calibration methods to a dynamic setting.
The paper tackles the problem of calibrating off-policy value predictions in infinite-horizon Markov decision processes by introducing Iterated Bellman Calibration, a model-agnostic post-hoc procedure that ensures states with similar predicted long-term returns have one-step returns consistent with the Bellman equation under the target policy, with finite-sample guarantees provided for both calibration and prediction without requiring Bellman completeness or realizability.
We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model's predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.