An Adiabatic Theorem for Policy Tracking with TD-learning
This work addresses the challenge of adapting reinforcement learning algorithms to dynamic environments, but it appears incremental as it builds on existing methods with new theoretical bounds.
The paper tackled the problem of tracking a changing policy's reward function using temporal difference learning, and derived finite-time bounds for tabular TD-learning and Q-learning under time-varying policies.
We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and $Q$-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.