LGJun 15, 2023

Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling

arXiv:2306.09554v16 citationsh-index: 69
Originality Highly original
AI Analysis

This addresses a key bottleneck in reinforcement learning by providing a more sample-efficient algorithm for researchers and practitioners working with complex, non-linear policies.

The paper tackles the problem of slow convergence and poor sample complexity in policy optimization for reinforcement learning with non-linear function approximation, achieving an ε-optimal policy with Õ(poly(d)/ε³) samples, which improves the previous best bound of Õ(poly(d)/ε⁸).

Policy optimization methods are powerful algorithms in Reinforcement Learning (RL) for their flexibility to deal with policy parameterization and ability to handle model misspecification. However, these methods usually suffer from slow convergence rates and poor sample complexity. Hence it is important to design provably sample efficient algorithms for policy optimization. Yet, recent advances for this problems have only been successful in tabular and linear setting, whose benign structures cannot be generalized to non-linearly parameterized policies. In this paper, we address this problem by leveraging recent advances in value-based algorithms, including bounded eluder-dimension and online sensitivity sampling, to design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation. We show that, our algorithm obtains an $\varepsilon$-optimal policy with only $\widetilde{O}(\frac{\text{poly}(d)}{\varepsilon^3})$ samples, where $\varepsilon$ is the suboptimality gap and $d$ is a complexity measure of the function class approximating the policy. This drastically improves previously best-known sample bound for policy optimization algorithms, $\widetilde{O}(\frac{\text{poly}(d)}{\varepsilon^8})$. Moreover, we empirically test our theory with deep neural nets to show the benefits of the theoretical inspiration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes