Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret
This work addresses the challenge of extending RL theory to more general distributions beyond common assumptions like linear models or log-concave posteriors, which is incremental as it builds on existing posterior sampling methods but applies them to a broader class of distributions.
The paper tackles the problem of designing reinforcement learning algorithms with sublinear regret for isoperimetric distributions, specifically those satisfying the Log-Sobolev Inequality, which includes non-log-concave and perturbed distributions beyond standard assumptions. It shows that Posterior Sampling-based RL achieves sublinear regret under these conditions, and proposes LaPSRL, a Langevin sampling-based algorithm that achieves order-optimal regret and subquadratic complexity per episode, validated experimentally across bandit and MDP environments with competitive performance.
Common assumptions, like linear or RKHS models, and Gaussian or log-concave posteriors over the models, do not explain practical success of RL across a wider range of distributions and models. Thus, we study how to design RL algorithms with sublinear regret for isoperimetric distributions, specifically the ones satisfying the Log-Sobolev Inequality (LSI). LSI distributions include the standard setups of RL theory, and others, such as many non-log-concave and perturbed distributions. First, we show that the Posterior Sampling-based RL (PSRL) algorithm yields sublinear regret if the data distributions satisfy LSI and some mild additional assumptions. Also, when we cannot compute or sample from an exact posterior, we propose a Langevin sampling-based algorithm design: LaPSRL. We show that LaPSRL achieves order-optimal regret and subquadratic complexity per episode. Finally, we deploy LaPSRL with a Langevin sampler -- SARAH-LD, and test it for different bandit and MDP environments. Experimental results validate the generality of LaPSRL across environments and its competitive performance with respect to the baselines.