LGOCMLMay 30, 2019

Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

arXiv:1905.12842v170 citations
Originality Incremental advance
AI Analysis

This work provides theoretical insights into reinforcement learning algorithms for continuous control, with incremental improvements in sample complexity and regret bounds.

The paper analyzes the sample complexity of approximate policy iteration for the Linear Quadratic Regulator, showing that policy evaluation requires at most (n+d)^3/ε^2 samples per step and only log(1/ε) improvement steps are needed, resulting in an overall complexity of (n+d)^3 ε^{-2} log(1/ε).

We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within $\varepsilon$ of the optimal LQR controller, each step of policy evaluation requires at most $(n+d)^3/\varepsilon^2$ samples, where $n$ is the dimension of the state vector and $d$ is the dimension of the input vector. On the other hand, only $\log(1/\varepsilon)$ policy improvement steps suffice, resulting in an overall sample complexity of $(n+d)^3 \varepsilon^{-2} \log(1/\varepsilon)$. We furthermore build on our analysis and construct a simple adaptive procedure based on $\varepsilon$-greedy exploration which relies on approximate PI as a sub-routine and obtains $T^{2/3}$ regret, improving upon a recent result of Abbasi-Yadkori et al.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes