Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible?
This work addresses the feasibility of prior-free black-box algorithms for non-stationary RL, revealing limitations in current approaches for researchers and practitioners.
The paper tackles the problem of non-stationary reinforcement learning without prior knowledge, showing that the state-of-the-art black-box algorithm MASTER performs similarly to random restarting in practice and has regret bounds that remain high until unrealistic horizons, while methods using quickest change detection outperform it in simulations.
We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system's non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER's non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.