LGNov 10, 2022

When is Realizability Sufficient for Off-Policy Reinforcement Learning?

arXiv:2211.05311v217 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the challenge of making off-policy reinforcement learning more practical for AI/ML researchers by reducing structural assumptions, though it is incremental as it builds on existing theory without introducing a new paradigm.

The paper tackles the problem of off-policy reinforcement learning by relaxing the strong assumption of Bellman completeness to only realizability, establishing finite-sample guarantees that eliminate the inherent Bellman error term and depend on factors like metric entropy, concentrability, and a new measure of Bellman completeness violation.

Model-free algorithms for reinforcement learning typically require a condition called Bellman completeness in order to successfully operate off-policy with function approximation, unless additional conditions are met. However, Bellman completeness is a requirement that is much stronger than realizability and that is deemed to be too strong to hold in practice. In this work, we relax this structural assumption and analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class. We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error, and that depend on the interplay of three factors. The first two are well known: they are the metric entropy of the function class and the concentrability coefficient that represents the cost of learning off-policy. The third factor is new, and it measures the violation of Bellman completeness, namely the mis-alignment between the chosen function class and its image through the Bellman operator. In essence, these error bounds establish that off-policy reinforcement learning remains statistically viable even in absence of Bellman completeness, and characterize the intermediate situation between the favorable Bellman complete setting and the worst-case scenario where exponential lower bounds are in force. Our analysis directly applies to the solution found by temporal difference algorithms when they converge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes