Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback
This work addresses a key challenge in reinforcement learning for scenarios with limited feedback, such as robotics or resource management, by providing near-optimal algorithms for online MDPs with aggregate bandit feedback, representing a significant advance over prior methods.
The paper tackles the problem of online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback, where only total trajectory loss is observed. It introduces the first Policy Optimization algorithms for this setting, achieving optimal regret bounds of θ̃(H^2√(SAK)) for known dynamics and Õ(H^3 S √(AK)) for unknown dynamics, improving prior results by a factor of H^2 S^5 A^2.
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first \textit{optimal} regret bound of $\tilde Θ(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.