ML LGMar 30

Functional Natural Policy Gradients

Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus

arXiv:2603.2868167.61 citationsh-index: 11

AI Analysis

This addresses offline reinforcement learning for researchers, offering a method to handle high-complexity policy classes with explicit trade-offs between policy and environment factors.

The paper tackles the problem of policy learning from offline data by proposing a cross-fitted debiasing device, achieving a regret bound of √N even for complex policy classes, provided a nuisance remainder condition is met.

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

View on arXiv PDF

Similar