AsyncQVI: Asynchronous-Parallel Q-Value Iteration for Discounted Markov Decision Processes with Near-Optimal Sample Complexity
This provides a scalable solution for large-scale reinforcement learning applications by reducing memory usage and enabling parallel processing, though it is incremental as it builds on existing value iteration methods.
The paper tackles the problem of solving discounted Markov decision processes with a generative model by proposing AsyncQVI, an asynchronous-parallel Q-value iteration algorithm that achieves near-optimal sample complexity, using memory of size O(|S|) and returning an ε-optimal policy with probability at least 1-δ using Õ(|S||A|/((1-γ)^5ε^2) log(1/δ)) samples.
In this paper, we propose AsyncQVI, an asynchronous-parallel Q-value iteration for discounted Markov decision processes whose transition and reward can only be sampled through a generative model. Given such a problem with $|\mathcal{S}|$ states, $|\mathcal{A}|$ actions, and a discounted factor $γ\in(0,1)$, AsyncQVI uses memory of size $\mathcal{O}(|\mathcal{S}|)$ and returns an $\varepsilon$-optimal policy with probability at least $1-δ$ using $$\tilde{\mathcal{O}}\big(\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^5\varepsilon^2}\log(\frac{1}δ)\big)$$ samples. AsyncQVI is also the first asynchronous-parallel algorithm for discounted Markov decision processes that has a sample complexity, which nearly matches the theoretical lower bound. The relatively low memory footprint and parallel ability make AsyncQVI suitable for large-scale applications. In numerical tests, we compare AsyncQVI with four sample-based value iteration methods. The results show that our algorithm is highly efficient and achieves linear parallel speedup.