Jiatai Huang

LG
7papers
83citations
Novelty71%
AI Score44

7 Papers

LGMay 30, 2022
RLx2: Training a Sparse Deep Reinforcement Learning Model from Scratch

Yiqin Tan, Pihe Hu, Ling Pan et al.

Training deep reinforcement learning (DRL) models usually requires high computation costs. Therefore, compressing DRL models possesses immense potential for training acceleration and model deployment. However, existing methods that generate small models mainly adopt the knowledge distillation-based approach by iteratively training a dense network. As a result, the training process still demands massive computing resources. Indeed, sparse training from scratch in DRL has not been well explored and is particularly challenging due to non-stationarity in bootstrap training. In this work, we propose a novel sparse DRL training framework, "the Rigged Reinforcement Learning Lottery" (RLx2), which builds upon gradient-based topology evolution and is capable of training a sparse DRL model based entirely on a sparse network. Specifically, RLx2 introduces a novel multi-step TD target mechanism with a dynamic-capacity replay buffer to achieve robust value learning and efficient topology exploration in sparse models. It also reaches state-of-the-art sparse training performance in several tasks, showing 7.5\times-20\times model compression with less than 3% performance degradation and up to 20\times and 50\times FLOPs reduction for training and inference, respectively.

OCMar 3, 2023
Queue Scheduling with Adversarial Bandit Learning

Jiatai Huang, Leana Golubchik, Longbo Huang

In this paper, we study scheduling of a queueing system with zero knowledge of instantaneous network conditions. We consider a one-hop single-server queueing system consisting of $K$ queues, each with time-varying and non-stationary arrival and service rates. Our scheduling approach builds on an innovative combination of adversarial bandit learning and Lyapunov drift minimization, without knowledge of the instantaneous network state (the arrival and service rates) of each queue. We then present two novel algorithms \texttt{SoftMW} (SoftMaxWeight) and \texttt{SSMW} (Sliding-window SoftMaxWeight), both capable of stabilizing systems that can be stablized by some (possibly unknown) sequence of randomized policies whose time-variation satisfies a mild condition. We further generalize our results to the setting where arrivals and departures only have bounded moments instead of being deterministically bounded and propose \texttt{SoftMW+} and \texttt{SSMW+} that are capable of stabilizing the system. As a building block of our new algorithms, we also extend the classical \texttt{EXP3.S} (Auer et al., 2002) algorithm for multi-armed bandits to handle unboundedly large feedback signals, which can be of independent interest.

LGJan 25, 2023
Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning

Jiatai Huang, Yan Dai, Longbo Huang · tsinghua

We propose Banker Online Mirror Descent (Banker-OMD), a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The Banker-OMD framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus facilitating the design of new algorithms capable of efficiently and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\widetilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. \texttt{Banker-OMD} leads to the first delayed scale-free adversarial MAB algorithm achieving $\widetilde{\mathcal O}(\sqrt{K}L(\sqrt T+\sqrt D))$ regret and the first delayed adversarial linear bandit algorithm achieving $\widetilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\widetilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $Ω(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.

LGFeb 1
The BoBW Algorithms for Heavy-Tailed MDPs

Yu Chen, Yuhao Liu, Jiatai Huang et al.

We investigate episodic Markov Decision Processes with heavy-tailed feedback (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms ```HT-FTRL-OM``` and ```HT-FTRL-UOB``` for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, ```HT-FTRL-OM``` applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{\mathcal{O}}(T^{1/α})$ regret bound in adversarial regimes and a $\mathcal{O}(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm ```HT-FTRL-UOB``` to tackle the more challenging unknown-transition setting. This algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{\mathcal{O}}(T^{1/α} + \sqrt{T})$ regret in adversarial regimes and a $\mathcal{O}(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

LGJan 28, 2022
Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits

Jiatai Huang, Yan Dai, Longbo Huang

In this paper, we generalize the concept of heavy-tailed multi-armed bandits to adversarial environments, and develop robust best-of-both-worlds algorithms for heavy-tailed multi-armed bandits (MAB), where losses have $α$-th ($1<α\le 2$) moments bounded by $σ^α$, while the variances may not exist. Specifically, we design an algorithm \texttt{HTINF}, when the heavy-tail parameters $α$ and $σ$ are known to the agent, \texttt{HTINF} simultaneously achieves the optimal regret for both stochastic and adversarial environments, without knowing the actual environment type a-priori. When $α,σ$ are unknown, \texttt{HTINF} achieves a $\log T$-style instance-dependent regret in stochastic cases and $o(T)$ no-regret guarantee in adversarial cases. We further develop an algorithm \texttt{AdaTINF}, achieving $\mathcal O(σK^{1-\nicefrac 1α}T^{\nicefrac{1}α})$ minimax optimal regret even in adversarial settings, without prior knowledge on $α$ and $σ$. This result matches the known regret lower-bound (Bubeck et al., 2013), which assumed a stochastic environment and $α$ and $σ$ are both known. To our knowledge, the proposed \texttt{HTINF} algorithm is the first to enjoy a best-of-both-worlds regret guarantee, and \texttt{AdaTINF} is the first algorithm that can adapt to both $α$ and $σ$ to achieve optimal gap-indepedent regret bound in classical heavy-tailed stochastic MAB setting and our novel adversarial formulation.

LGOct 26, 2021
Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays

Jiatai Huang, Yan Dai, Longbo Huang

We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $Ω((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.

LGJun 16, 2021
Banker Online Mirror Descent

Jiatai Huang, Longbo Huang

We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.