LGNov 11, 2024

Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Haolin Liu, Zakaria Mhammedi, Chen-Yu Wei, Julian Zimmert

arXiv:2411.06739v16.43 citationsh-index: 15NIPS

Originality Highly original

AI Analysis

This addresses the problem of efficient learning in complex environments for reinforcement learning researchers, with incremental improvements in regret bounds and new settings.

The paper tackles regret minimization in low-rank MDPs with adversarial losses, improving the regret bound from T^{5/6} to T^{2/3} for full-information feedback with unknown transitions, and introduces algorithms for bandit feedback with unknown transitions achieving T^{2/3} regret under linear loss structure, showing that without this structure, regret scales polynomially with states.

We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the $poly(d, A, H)T^{5/6}$ regret bound of Zhao et al. (2024) to $poly(d, A, H)T^{2/3}$ for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model based and model free algorithms achieving $poly(d, A, H)T^{2/3}$ regret, though they are computationally inefficient. We also propose oracle-efficient model-free algorithms with $poly(d, A, H)T^{4/5}$ regret. We show that the linear structure is necessary for the bandit case without structure on the reward function, the regret has to scale polynomially with the number of states. This is contrary to the full-information case (Zhao et al., 2024), where the regret can be independent of the number of states even for unstructured reward function.

View on arXiv PDF

Similar