The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting
This work addresses a practical issue in RL for fields like healthcare, where human-in-the-loop decisions cause non-stationarity, but it is incremental as it builds on prior stationary environment results.
The paper tackles the problem of balancing cumulative regret (CR) and simple regret (SR) in sequential reinforcement learning tasks with non-stationarity, showing that non-stationarity leads to a stricter trade-off requiring excessive exploration and a CR bound worse than the typical optimal rate of T^{1/2}.
Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, $T$. In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading to a CR bound worse than the typical optimal rate of $T^{1/2}$. These findings are practically significant, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.