On The Sample Complexity Bounds In Bilevel Reinforcement Learning
This provides foundational theoretical guarantees for aligning generative models in reinforcement learning, though it is incremental in extending existing optimization techniques to this specific domain.
The paper tackles the lack of theoretical sample complexity bounds in bilevel reinforcement learning by establishing the first bound of O(ε^{-3}) in continuous spaces, improving upon prior O(ε^{-6}) bounds and addressing computational bottlenecks with a Hessian-free algorithm.
Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(ε^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(ε^{-3})$ improving upon existing bounds of $\mathcal{O}(ε^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.