LGMLJun 6, 2022

Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks

arXiv:2206.02887v217 citationsh-index: 36
Originality Incremental advance
AI Analysis

This addresses sample efficiency in reinforcement learning for practitioners dealing with high-dimensional data, though it is incremental as it builds on existing fitted Q-evaluation methods.

The paper tackles the off-policy evaluation problem in reinforcement learning by analyzing deep fitted Q-evaluation with deep convolutional neural networks, showing that leveraging low-dimensional manifold structures yields a sample-efficient estimator with error bounds dependent on intrinsic dimension, smoothness, and a function class-restricted divergence.

We consider the off-policy evaluation problem of reinforcement learning using deep convolutional neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Specifically, we establish a sharp error bound for fitted Q-evaluation, which depends on the intrinsic dimension of the state-action space, the smoothness of Bellman operator, and a function class-restricted $χ^2$-divergence. It is noteworthy that the restricted $χ^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. We also develop a novel approximation result for convolutional neural networks in Q-function estimation. Numerical experiments are provided to support our theoretical analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes