LGAIFeb 10, 2021

Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case

arXiv:2102.05284v232 citations
Originality Highly original
AI Analysis

This work solves a challenging reinforcement learning problem with adversarial costs and unknown transitions, offering algorithms with improved regret bounds for researchers and practitioners in online learning and control.

The paper tackles the stochastic shortest path problem with adversarial costs and unknown transitions, achieving regret bounds of $\widetilde{O}(\sqrt{S^2ADT_\star K})$ for full-information and $\widetilde{O}(\sqrt{S^3A^2DT_\star K})$ for bandit feedback settings, improving prior work and addressing a previously unsolved combination.

We make significant progress toward the stochastic shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $\widetilde{O}(\sqrt{S^2ADT_\star K})$ regret for the full-information setting and $\widetilde{O}(\sqrt{S^3A^2DT_\star K})$ regret for the bandit feedback setting, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition. To remedy the gap between our upper bounds and the current best lower bounds constructed via a stochastically oblivious adversary, we also propose algorithms with near-optimal regret for this special case.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes