Regularized Off-Policy TD-Learning
This work addresses the challenge of computational efficiency and feature selection in reinforcement learning for practitioners, though it appears incremental as it builds on existing gradient TD methods.
The paper tackles the problem of learning sparse value function representations in off-policy reinforcement learning by proposing RO-TD, a novel $l_1$ regularized TD-learning method, which achieves off-policy convergence and low computational cost as demonstrated in experiments.
We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.