Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward
This addresses a specific challenge in RL for scenarios with complex reward feedback, but it is incremental as it builds on existing MDP frameworks with new constraints.
The paper tackles the problem of reinforcement learning with delayed, composite, and partially anonymous rewards in an infinite-horizon MDP, proposing the DUCRL2 algorithm that achieves a near-optimal regret bound of ̃O(DS√AT + d(SA)^3), demonstrating optimality in the order of T and an additive impact of delay.
We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.