LG AIDec 1, 2022

Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

arXiv:2212.00603v118.133 citationsh-index: 36

Originality Incremental advance

AI Analysis

This work addresses the efficiency of reinforcement learning algorithms for AMDPs, providing near-optimal sample bounds that are significant for researchers and practitioners in AI and control theory, though it is incremental as it builds on existing reduction techniques.

The paper tackles the sample complexity of finding an ε-optimal policy in average reward Markov Decision Processes (AMDPs) with a generative model, proving an upper bound of Õ(H ε^{-3} ln(1/δ)) samples per state-action pair and a matching lower bound up to logarithmic factors, improving upon prior mixing-time-based approaches.

This work considers the sample complexity of obtaining an $\varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of $\widetilde O(H \varepsilon^{-3} \ln \frac{1}δ)$ samples per state-action pair, where $H := sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the accuracy and $δ$ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of $Ω(|\mathcal S| |\mathcal A| H \varepsilon^{-2} \ln \frac{1}δ)$ total samples, showing that a linear dependent on $H$ is necessary and that our upper bound matches the lower bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln \frac{1}δ)$ up to some logarithmic factors.

View on arXiv PDF

Similar