Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition
This addresses a fundamental problem in online learning and reinforcement learning for researchers and practitioners, with incremental algorithmic improvements.
The paper tackles the stochastic shortest path problem with adversarial costs and known transition, achieving minimax regret bounds of $\widetilde{O}(\sqrt{DT^\star K})$ for full-information and $\widetilde{O}(\sqrt{DT^\star SA K})$ for bandit feedback, significantly improving upon prior work and being the first to handle bandit feedback in this setting.
We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.