OC LG SYSep 17, 2018

Optimal Matrix Momentum Stochastic Approximation and Applications to Q-learning

arXiv:1809.06277v211.16 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of balancing computational efficiency and performance in stochastic optimization for reinforcement learning, offering a novel method that improves upon existing techniques like Ruppert-Polyak averaging and stochastic Newton-Raphson, though it is incremental in the context of acceleration methods.

The paper tackles the problem of achieving optimal asymptotic variance in stochastic approximation algorithms, which is important for reinforcement learning applications like Q-learning, by introducing PolSA and NeSA, two new root-finding algorithms with matrix momentum, and demonstrates that PolSA's parameter estimates couple with those of the ideal but complex stochastic Newton-Raphson algorithm, providing a third approach to optimal asymptotic covariance.

Acceleration is an increasingly common theme in the stochastic optimization literature. The two most common examples are Nesterov's method, and Polyak's momentum technique. In this paper two new algorithms are introduced for root finding problems: 1) PolSA is a root finding algorithm with specially designed matrix momentum, and 2) NeSA can be regarded as a variant of Nesterov's algorithm, or a simplification of PolSA. The PolSA algorithm is new even in the context of optimization (when cast as a root finding problem). The research surveyed in this paper is motivated by applications to reinforcement learning. It is well known that most variants of TD- and Q-learning may be cast as SA (stochastic approximation) algorithms, and the tools from general SA theory can be used to investigate convergence and bounds on convergence rate. In particular, the asymptotic variance is a common metric of performance for SA algorithms, and is also one among many metrics used in assessing the performance of stochastic optimization algorithms. There are two well known SA techniques that are known to have optimal asymptotic variance: the Ruppert-Polyak averaging technique, and stochastic Newton-Raphson (SNR). The former algorithm can have extremely bad transient performance, and the latter can be computationally expensive. It is demonstrated here that parameter estimates from the new PolSA algorithm couple with those of the ideal (but more complex) SNR algorithm. The new algorithm is thus a third approach to obtain optimal asymptotic covariance. These strong results require assumptions on the model. A linearized model is considered, and the noise is assumed to be a martingale difference sequence. Numerical results are obtained in a non-linear setting that is the motivation for this work: In PolSA implementations of Q-learning it is observed that coupling occurs with SNR in this non-ideal setting.

View on arXiv PDF

Similar