Sampling Complexity of TD and PPO in RKHS
This work provides a firmer theoretical foundation for PPO beyond finite-dimensional assumptions, addressing the problem of understanding and improving reinforcement learning algorithms for researchers and practitioners, though it is incremental in extending existing methods to RKHS settings.
The paper tackled the theoretical analysis of Proximal Policy Optimization (PPO) by decoupling policy evaluation and improvement in a reproducing kernel Hilbert space (RKHS), providing non-asymptotic guarantees and deriving a sampling rule for optimal convergence rates. The result showed improved stability and sample efficiency on control tasks like CartPole and Acrobot, with the TD-based critic achieving favorable throughput compared to a baseline.
We revisit Proximal Policy Optimization (PPO) from a function-space perspective. Our analysis decouples policy evaluation and improvement in a reproducing kernel Hilbert space (RKHS): (i) A kernelized temporal-difference (TD) critic performs efficient RKHS-gradient updates using only one-step state-action transition samples; (ii) a KL-regularized, natural-gradient policy step exponentiates the evaluated action-value, recovering a PPO/TRPO-style proximal update in continuous state-action spaces. We provide non-asymptotic, instance-adaptive guarantees whose rates depend on RKHS entropy, unifying tabular, linear, Sobolev, Gaussian, and Neural Tangent Kernel (NTK) regimes, and we derive a sampling rule for the proximal update that ensures the optimal $k^{-1/2}$ convergence rate for stochastic optimization. Empirically, the theory-aligned schedule improves stability and sample efficiency on common control tasks (e.g., CartPole, Acrobot), while our TD-based critic attains favorable throughput versus a GAE baseline. Altogether, our results place PPO on a firmer theoretical footing beyond finite-dimensional assumptions and clarify when RKHS-proximal updates with kernel-TD critics yield global policy improvement with practical efficiency.