LGFeb 1

The BoBW Algorithms for Heavy-Tailed MDPs

arXiv:2602.01295v1
Originality Highly original
AI Analysis

This work addresses the problem of handling heavy-tailed feedback in Markov Decision Processes, which is significant for researchers and practitioners dealing with uncertain or adversarial environments in machine learning and decision-making.

The authors tackled the problem of episodic Markov Decision Processes with heavy-tailed feedback, achieving Best-of-Both-Worlds guarantees with regret bounds of $widetilde{mathcal{O}}(T^{1/α})$ in adversarial regimes and $mathcal{O}(log T)$ or $mathcal{O}(log^2(T))$ in stochastic regimes. The proposed algorithms, HT-FTRL-OM and HT-FTRL-UOB, provide instance-independent and instance-dependent regret guarantees in different environments.

We investigate episodic Markov Decision Processes with heavy-tailed feedback (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms ```HT-FTRL-OM``` and ```HT-FTRL-UOB``` for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, ```HT-FTRL-OM``` applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{\mathcal{O}}(T^{1/α})$ regret bound in adversarial regimes and a $\mathcal{O}(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm ```HT-FTRL-UOB``` to tackle the more challenging unknown-transition setting. This algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{\mathcal{O}}(T^{1/α} + \sqrt{T})$ regret in adversarial regimes and a $\mathcal{O}(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes