LGFeb 12, 2023
Theory on Forgetting and Generalization of Continual LearningSen Lin, Peizhong Ju, Yingbin Liang et al.
Continual learning (CL), which aims to learn a sequence of tasks, has attracted significant recent attention. However, most work has focused on the experimental performance of CL, and theoretical studies of CL are still limited. In particular, there is a lack of understanding on what factors are important and how they affect "catastrophic forgetting" and generalization performance. To fill this gap, our theoretical analysis, under overparameterized linear models, provides the first-known explicit form of the expected forgetting and generalization error. Further analysis of such a key result yields a number of theoretical explanations about how overparameterization, task similarity, and task ordering affect both forgetting and generalization error of CL. More interestingly, by conducting experiments on real datasets using deep neural networks (DNNs), we show that some of these insights even go beyond the linear models and can be carried over to practical setups. In particular, we use concrete examples to show that our results not only explain some interesting empirical observations in recent studies, but also motivate better practical algorithm designs of CL.
LGJun 23, 2022
Provably Efficient Model-Free Constrained RL with Linear Function ApproximationArnob Ghosh, Xingyu Zhou, Ness Shroff
We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
LGMar 10, 2023
Provably Efficient Model-Free Algorithms for Non-stationary CMDPsHonghao Wei, Arnob Ghosh, Ness Shroff et al.
We study model-free reinforcement learning (RL) algorithms in episodic non-stationary constrained Markov Decision Processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a cumulative constraint on the expected utility (cost). In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach for both tabular and linear approximation settings.
LGFeb 8, 2023
A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard ConstraintsMing Shi, Yingbin Liang, Ness Shroff
In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for ''safe'' RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{Δ_c})$ that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where $d$ is the feature-mapping dimension, $K$ is the number of episodes, $H$ is the number of steps in each episode, and $Δ_c$ is a safety-related parameter. We also provide a lower bound $\tildeΩ(\max\{dH \sqrt{K}, \frac{H}{Δ_c^2}\})$, which indicates that the dependency on $Δ_c$ is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest.
OCAug 7, 2023
Non-Convex Bilevel Optimization with Time-Varying Objective FunctionsSen Lin, Daouda Sow, Kaiyi Ji et al.
Bilevel optimization has become a powerful tool in a wide variety of machine learning problems. However, the current nonconvex bilevel optimization considers an offline dataset and static functions, which may not work well in emerging online applications with streaming data and time-varying functions. In this work, we study online bilevel optimization (OBO) where the functions can be time-varying and the agent continuously updates the decisions with online streaming data. To deal with the function variations and the unavailability of the true hypergradients in OBO, we propose a single-loop online bilevel optimizer with window averaging (SOBOW), which updates the outer-level decision based on a window average of the most recent hypergradient estimations stored in the memory. Compared to existing algorithms, SOBOW is computationally efficient and does not need to know previous functions. To handle the unique technical difficulties rooted in single-loop update and function variations for OBO, we develop a novel analytical technique that disentangles the complex couplings between decision variables, and carefully controls the hypergradient estimation error. We show that SOBOW can achieve a sublinear bilevel local regret under mild conditions. Extensive experiments across multiple domains corroborate the effectiveness of SOBOW.
LGFeb 8, 2023
Near-Optimal Adversarial Reinforcement Learning with Switching CostsMing Shi, Yingbin Liang, Ness Shroff
Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $β$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tildeΩ( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
MLOct 4, 2023
Hoeffding's Inequality for Markov Chains under Generalized Concentrability ConditionHao Chen, Abhishek Gupta, Yin Sun et al.
This paper studies Hoeffding's inequality for Markov chains under the generalized concentrability condition defined via integral probability metric (IPM). The generalized concentrability condition establishes a framework that interpolates and extends the existing hypotheses of Markov chain Hoeffding-type inequalities. The flexibility of our framework allows Hoeffding's inequality to be applied beyond the ergodic Markov chains in the traditional sense. We demonstrate the utility by applying our framework to several non-asymptotic analyses arising from the field of machine learning, including (i) a generalization bound for empirical risk minimization with Markovian samples, (ii) a finite sample guarantee for Ployak-Ruppert averaging of SGD, and (iii) a new regret bound for rested Markovian bandits with general state space.
LGMay 26
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion ModelsYuchen Liang, Ness Shroff, Yingbin Liang
Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.
LGJun 22, 2023
Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via GroupingYining Li, Peizhong Ju, Ness Shroff
Reinforcement learning often needs to deal with the exponential growth of states and actions when exploring optimal control in high-dimensional spaces (often known as the curse of dimensionality). In this work, we address this issue by learning the inherent structure of action-wise similar MDP to appropriately balance the performance degradation versus sample/computational complexity. In particular, we partition the action spaces into multiple groups based on the similarity in transition distribution and reward function, and build a linear decomposition model to capture the difference between the intra-group transition kernel and the intra-group rewards. Both our theoretical analysis and experiments reveal a \emph{surprising and counter-intuitive result}: while a more refined grouping strategy can reduce the approximation error caused by treating actions in the same group as identical, it also leads to increased estimation error when the size of samples or the computation resources is limited. This finding highlights the grouping strategy as a new degree of freedom that can be optimized to minimize the overall performance loss. To address this issue, we formulate a general optimization problem for determining the optimal grouping strategy, which strikes a balance between performance loss and sample/computational complexity. We further propose a computationally efficient method for selecting a nearly-optimal grouping strategy, which maintains its computational complexity independent of the size of the action space.
NIOct 25, 2024Code
Artificial Intelligence of Things: A SurveyShakhrul Iman Siam, Hyunho Ahn, Li Liu et al.
The integration of the Internet of Things (IoT) and modern Artificial Intelligence (AI) has given rise to a new paradigm known as the Artificial Intelligence of Things (AIoT). In this survey, we provide a systematic and comprehensive review of AIoT research. We examine AIoT literature related to sensing, computing, and networking & communication, which form the three key components of AIoT. In addition to advancements in these areas, we review domain-specific AIoT systems that are designed for various important application domains. We have also created an accompanying GitHub repository, where we compile the papers included in this survey: https://github.com/AIoT-MLSys-Lab/AIoT-Survey. This repository will be actively maintained and updated with new research as it becomes available. As both IoT and AI become increasingly critical to our society, we believe AIoT is emerging as an essential research field at the intersection of IoT and modern AI. We hope this survey will serve as a valuable resource for those engaged in AIoT research and act as a catalyst for future explorations to bridge gaps and drive advancements in this exciting field.
LGSep 5, 2024
Can We Theoretically Quantify the Impacts of Local Updates on the Generalization Performance of Federated Learning?Peizhong Ju, Haibo Yang, Jia Liu et al.
Federated Learning (FL) has gained significant popularity due to its effectiveness in training machine learning models across diverse sites without requiring direct data sharing. While various algorithms along with their optimization analyses have shown that FL with local updates is a communication-efficient distributed learning framework, the generalization performance of FL with local updates has received comparatively less attention. This lack of investigation can be attributed to the complex interplay between data heterogeneity and infrequent communication due to the local updates within the FL framework. This motivates us to investigate a fundamental question in FL: Can we quantify the impact of data heterogeneity and local updates on the generalization performance for FL as the learning process evolves? To this end, we conduct a comprehensive theoretical study of FL's generalization performance using a linear model as the first step, where the data heterogeneity is considered for both the stationary and online/non-stationary cases. By providing closed-form expressions of the model error, we rigorously quantify the impact of the number of the local updates (denoted as $K$) under three settings ($K=1$, $K<\infty$, and $K=\infty$) and show how the generalization performance evolves with the number of rounds $t$. Our investigation also provides a comprehensive understanding of how different configurations (including the number of model parameters $p$ and the number of training samples $n$) contribute to the overall generalization performance, thus shedding new insights (such as benign overfitting) for implementing FL over networks.
LGFeb 26
Sharp Convergence Rates for Masked Diffusion ModelsYuchen Liang, Zhiheng Tan, Ness Shroff et al.
Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, with masked (absorbing-rate) variants emerging as competitive alternatives to autoregressive models. Among existing samplers, the Euler method remains the standard choice in many applications, and more recently, the First-Hitting Sampler (FHS) has shown considerable promise for masked diffusion models. Despite their practical success, the theoretical understanding of these samplers remains limited. Existing analyses are conducted in Kullback-Leibler (KL) divergence, which often yields loose parameter dependencies and requires strong assumptions on score estimation. Moreover, these guarantees do not cover recently developed high-performance sampler of FHS. In this work, we first develop a direct total-variation (TV) based analysis for the Euler method that overcomes these limitations. Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees without requiring any surrogate initialization. Also for this setting, we provide the first convergence lower bound for the Euler sampler, establishing tightness with respect to both the data dimension $d$ and the target accuracy $\varepsilon$. Finally, we analyze the FHS sampler and show that it incurs no sampling error beyond that induced by score estimation, which we show to be tight with a matching lower error bound. Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS, which may be of independent interest.
LGFeb 25
Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-DualYining Li, Peizhong Ju, Ness Shroff
Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.
LGFeb 21, 2024
Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis ApproachYuchen Liang, Peizhong Ju, Yingbin Liang et al.
Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard $\mathcal O(1/ε^2)$ rate of vanilla diffusion models, where $ε$ denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density $q_0$, which is far more relaxed than the existing smoothness conditions posed to all $q_t$ along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension $d$. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie's formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.
LGJun 2, 2025
Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion ModelsYuchen Liang, Renxiang Huang, Lifeng Lai et al.
Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the $τ$-leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.
LGOct 17, 2024
Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional SamplersYuchen Liang, Peizhong Ju, Yingbin Liang et al.
The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in the zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.
IMOct 6, 2025
Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)Lucas Carrit Delgado Pinheiro, Ziru Chen, Bruno Caixeta Piazza et al.
While task-specific demonstrations show early success in applying large language models (LLMs) to automate some astronomical research tasks, they only provide incomplete views of all necessary capabilities in solving astronomy problems, calling for more thorough understanding of LLMs' strengths and limitations. So far, existing benchmarks and evaluations focus on simple question-answering that primarily tests astronomical knowledge and fails to evaluate the complex reasoning required for real-world research in the discipline. Here, we address this gap by systematically benchmarking five state-of-the-art LLMs on the International Olympiad on Astronomy and Astrophysics (IOAA) exams, which are designed to examine deep conceptual understanding, multi-step derivations, and multimodal analysis. With average scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 (the two top-performing models) not only achieve gold medal level performance but also rank in the top two among ~200-300 participants in all four IOAA theory exams evaluated (2022-2025). In comparison, results on the data analysis exams show more divergence. GPT-5 still excels in the exams with an 88.5% average score, ranking top 10 among the participants in the four most recent IOAAs, while other models' performances drop to 48-76%. Furthermore, our in-depth error analysis underscores conceptual reasoning, geometric reasoning, and spatial visualization (52-79% accuracy) as consistent weaknesses among all LLMs. Hence, although LLMs approach peak human performance in theory exams, critical gaps must be addressed before they can serve as autonomous research agents in astronomy.
CYJun 21, 2025
AI Safety vs. AI Security: Demystifying the Distinction and BoundariesZhiqiang Lin, Huan Sun, Ness Shroff
Artificial Intelligence (AI) is rapidly being integrated into critical systems across various domains, from healthcare to autonomous vehicles. While its integration brings immense benefits, it also introduces significant risks, including those arising from AI misuse. Within the discourse on managing these risks, the terms "AI Safety" and "AI Security" are often used, sometimes interchangeably, resulting in conceptual confusion. This paper aims to demystify the distinction and delineate the precise research boundaries between AI Safety and AI Security. We provide rigorous definitions, outline their respective research focuses, and explore their interdependency, including how security breaches can precipitate safety failures and vice versa. Using clear analogies from message transmission and building construction, we illustrate these distinctions. Clarifying these boundaries is crucial for guiding precise research directions, fostering effective cross-disciplinary collaboration, enhancing policy effectiveness, and ultimately, promoting the deployment of trustworthy AI systems.
LGSep 20, 2025
Discrete Diffusion Models: Novel Analysis and New Sampler GuaranteesYuchen Liang, Yingbin Liang, Lifeng Lai et al.
Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $τ$-leaping samplers have become particularly popular due to their theoretical and empirical success. However, existing theoretical analyses of $τ$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $τ$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $τ$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.
LGMay 30, 2025
Unlocking the Power of Rehearsal in Continual Learning: A Theoretical PerspectiveJunze Deng, Qinhang Wu, Peizhong Ju et al.
Rehearsal-based methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.
LGAug 26, 2021
Adaptive Control of Differentially Private Linear Quadratic SystemsSayak Ray Chowdhury, Xingyu Zhou, Ness Shroff
In this paper, we study the problem of regret minimization in reinforcement learning (RL) under differential privacy constraints. This work is motivated by the wide range of RL applications for providing personalized service, where privacy concerns are becoming paramount. In contrast to previous works, we take the first step towards non-tabular RL settings, while providing a rigorous privacy guarantee. In particular, we consider the adaptive control of differentially private linear quadratic (LQ) systems. We develop the first private RL algorithm, PRL, which is able to attain a sub-linear regret while guaranteeing privacy protection. More importantly, the additional cost due to privacy is only on the order of $\frac{\ln(1/δ)^{1/4}}{ε^{1/2}}$ given privacy parameters $ε, δ> 0$. Through this process, we also provide a general procedure for adaptive control of LQ systems under changing regularizers, which not only generalizes previous non-private controls, but also serves as the basis for general private controls.
LGJul 6, 2021
Weighted Gaussian Process Bandits for Non-stationary EnvironmentsYuntian Deng, Xingyu Zhou, Baekjin Kim et al.
In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To this end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GP-UCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.
LGFeb 11, 2021
No-Regret Algorithms for Time-Varying Bayesian OptimizationXingyu Zhou, Ness Shroff
In this paper, we consider the time-varying Bayesian optimization problem. The unknown function at each time is assumed to lie in an RKHS (reproducing kernel Hilbert space) with a bounded norm. We adopt the general variation budget model to capture the time-varying environment, and the variation is characterized by the change of the RKHS norm. We adapt the restart and sliding window mechanism to introduce two GP-UCB type algorithms: R-GP-UCB and SW-GP-UCB, respectively. We derive the first (frequentist) regret guarantee on the dynamic regret for both algorithms. Our results not only recover previous linear bandit results when a linear kernel is used, but complement the previous regret analysis of time-varying Gaussian process bandit under a Bayesian-type regularity assumption, i.e., each function is a sample from a Gaussian process.
LGOct 28, 2020
Bandit Policies for Reliable Cellular Network Handovers in Extreme MobilityYuanjie Li, Esha Datta, Jiaxin Ding et al.
The demand for seamless Internet access under extreme user mobility, such as on high-speed trains and vehicles, has become a norm rather than an exception. However, the 4G/5G mobile network is not always reliable to meet this demand, with non-negligible failures during the handover between base stations. A fundamental challenge of reliability is to balance the exploration of more measurements for satisfactory handover, and exploitation for timely handover (before the fast-moving user leaves the serving base station's radio coverage). This paper formulates this trade-off in extreme mobility as a composition of two distinct multi-armed bandit problems. We propose Bandit and Threshold Tuning (BATT) to minimize the regret of handover failures in extreme mobility. BATT uses $ε$-binary-search to optimize the threshold of the serving cell's signal strength to initiate the handover procedure with $\mathcal{O}(\log J \log T)$ regret.It further devises opportunistic Thompson sampling, which optimizes the sequence of the target cells to measure for reliable handover with $\mathcal{O}(\log T)$ regret.Our experiment over a real LTE dataset from Chinese high-speed rails validates significant regret reduction and a 29.1% handover failure reduction.
LGOct 13, 2020
Multi-Armed Bandits with Dependent ArmsRahul Singh, Fang Liu, Yin Sun et al.
We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm $i$ not only reveals information about its own reward distribution, but also about all those arms that share the same cluster with arm $i$. This "correlation" amongst the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as $O(K\log T)$, where $K$ is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that is optimal for the classical MABP and does not utilize these dependencies, the regret scales as $O(M\log T)$ where $M$ is the number of arms.
CRJun 23, 2020
ACOUSTIC-TURF: Acoustic-based Privacy-Preserving COVID-19 Contact TracingYuxiang Luo, Cheng Zhang, Yunqi Zhang et al.
In this paper, we propose a new privacy-preserving, automated contact tracing system, ACOUSTIC-TURF, to fight COVID-19 using acoustic signals sent from ubiquitous mobile devices. At a high level, ACOUSTIC-TURF adaptively broadcasts inaudible ultrasonic signals with randomly generated IDs in the vicinity. Simultaneously, the system receives other ultrasonic signals sent from nearby (e.g., 6 feet) users. In such a system, individual user IDs are not disclosed to others and the system can accurately detect encounters in physical proximity with 6-foot granularity. We have implemented a prototype of ACOUSTIC-TURF on Android and evaluated its performance in terms of acoustic-signal-based encounter detection accuracy and power consumption at different ranges and under various occlusion scenarios. Experimental results show that ACOUSTIC-TURF can detect multiple contacts within a 6-foot range for mobile phones placed in pockets and outside pockets. Furthermore, our acoustic-signal-based system achieves greater precision than wireless-signal-based approaches when contact tracing is performed through walls. ACOUSTIC-TURF correctly determines that people on opposite sides of a wall are not in contact with one another, whereas the Bluetooth-based approaches detect nonexistent contacts among them.
LGJun 6, 2020
Contextual Bandits with Side-ObservationsRahul Singh, Fang Liu, Xin Liu et al.
We investigate contextual bandits in the presence of side-observations across arms in order to design recommendation algorithms for users connected via social networks. Users in social networks respond to their friends' activity, and hence provide information about each other's preferences. In our model, when a learning algorithm recommends an article to a user, not only does it observe his/her response (e.g. an ad click), but also the side-observations, i.e., the response of his neighbors if they were presented with the same article. We model these observation dependencies by a graph $\mathcal{G}$ in which nodes correspond to users, and edges correspond to social links. We derive a problem/instance-dependent lower-bound on the regret of any consistent algorithm. We propose an optimization (linear programming) based data-driven learning algorithm that utilizes the structure of $\mathcal{G}$ in order to make recommendations to users and show that it is asymptotically optimal, in the sense that its regret matches the lower-bound as the number of rounds $T\to\infty$. We show that this asymptotically optimal regret is upper-bounded as $O\left(|χ(\mathcal{G})|\log T\right)$, where $|χ(\mathcal{G})|$ is the domination number of $\mathcal{G}$. In contrast, a naive application of the existing learning algorithms results in $O\left(N\log T\right)$ regret, where $N$ is the number of users.
LGMay 16, 2019
Data Poisoning Attacks on Stochastic BanditsFang Liu, Ness Shroff
Stochastic multi-armed bandits form a class of online learning problems that have important applications in online recommendation systems, adaptive medical treatment, and many others. Even though potential attacks against these learning algorithms may hijack their behavior, causing catastrophic loss in real-world applications, little is known about adversarial attacks on bandit algorithms. In this paper, we propose a framework of offline attacks on bandit algorithms and study convex optimization based attacks on several popular bandit algorithms. We show that the attacker can force the bandit algorithm to pull a target arm with high probability by a slight manipulation of the rewards in the data. Then we study a form of online attacks on bandit algorithms and propose an adaptive attack strategy against any bandit algorithm without the knowledge of the bandit algorithm. Our adaptive attack strategy can hijack the behavior of the bandit algorithm to suffer a linear regret with only a logarithmic cost to the attacker. Our results demonstrate a significant security threat to stochastic bandits.
LGOct 28, 2018
Exploring $k$ out of Top $ρ$ Fraction of Arms in Stochastic BanditsWenbo Ren, Jia Liu, Ness Shroff
This paper studies the problem of identifying any $k$ distinct arms among the top $ρ$ fraction (e.g., top 5\%) of arms from a finite or infinite set with a probably approximately correct (PAC) tolerance $ε$. We consider two cases: (i) when the threshold of the top arms' expected rewards is known and (ii) when it is unknown. We prove lower bounds for the four variants (finite or infinite arms, and known or unknown threshold), and propose algorithms for each. Two of these algorithms are shown to be sample complexity optimal (up to constant factors) and the other two are optimal up to a log factor. Results in this paper provide up to $ρn/k$ reductions compared with the "$k$-exploration" algorithms that focus on finding the (PAC) best $k$ arms out of $n$ arms. We also numerically show improvements over the state-of-the-art.
MLMay 23, 2018
Analysis of Thompson Sampling for Graphical Bandits Without the GraphsFang Liu, Zizhan Zheng, Ness Shroff
We study multi-armed bandit problems with graph feedback, in which the decision maker is allowed to observe the neighboring actions of the chosen action, in a setting where the graph may vary over time and is never fully revealed to the decision maker. We show that when the feedback graphs are undirected, the original Thompson Sampling achieves the optimal (within logarithmic factors) regret $\tilde{O}\left(\sqrt{β_0(G)T}\right)$ over time horizon $T$, where $β_0(G)$ is the average independence number of the latent graphs. To the best of our knowledge, this is the first result showing that the original Thompson Sampling is optimal for graphical bandits in the undirected setting. A slightly weaker regret bound of Thompson Sampling in the directed setting is also presented. To fill this gap, we propose a variant of Thompson Sampling, that attains the optimal regret in the directed setting within a logarithmic factor. Both algorithms can be implemented efficiently and do not require the knowledge of the feedback graphs at any time.
LGApr 16, 2018
UCBoost: A Boosting Approach to Tame Complexity and Optimality for Stochastic BanditsFang Liu, Sinong Wang, Swapna Buccapatnam et al.
In this work, we address the open problem of finding low-complexity near-optimal multi-armed bandit algorithms for sequential decision making problems. Existing bandit algorithms are either sub-optimal and computationally simple (e.g., UCB1) or optimal and computationally complex (e.g., kl-UCB). We propose a boosting approach to Upper Confidence Bound based algorithms for stochastic bandits, that we call UCBoost. Specifically, we propose two types of UCBoost algorithms. We show that UCBoost($D$) enjoys $O(1)$ complexity for each arm per round as well as regret guarantee that is $1/e$-close to that of the kl-UCB algorithm. We propose an approximation-based UCBoost algorithm, UCBoost($ε$), that enjoys a regret guarantee $ε$-close to that of kl-UCB as well as $O(\log(1/ε))$ complexity for each arm per round. Hence, our algorithms provide practitioners a practical way to trade optimality with computational complexity. Finally, we present numerical results which show that UCBoost($ε$) can achieve the same regret performance as the standard kl-UCB while incurring only $1\%$ of the computational cost of kl-UCB.
LGNov 8, 2017
A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit ProblemFang Liu, Joohyun Lee, Ness Shroff
The multi-armed bandit problem has been extensively studied under the stationary assumption. However in reality, this assumption often does not hold because the distributions of rewards themselves may change over time. In this paper, we propose a change-detection (CD) based framework for multi-armed bandit problems under the piecewise-stationary setting, and study a class of change-detection based UCB (Upper Confidence Bound) policies, CD-UCB, that actively detects change points and restarts the UCB indices. We then develop CUSUM-UCB and PHT-UCB, that belong to the CD-UCB class and use cumulative sum (CUSUM) and Page-Hinkley Test (PHT) to detect changes. We show that CUSUM-UCB obtains the best known regret upper bound under mild assumptions. We also demonstrate the regret reduction of the CD-UCB policies over arbitrary Bernoulli rewards and Yahoo! datasets of webpage click-through rates.
LGNov 8, 2017
Information Directed Sampling for Stochastic Bandits with Graph FeedbackFang Liu, Swapna Buccapatnam, Ness Shroff
We consider stochastic multi-armed bandit problems with graph feedback, where the decision maker is allowed to observe the neighboring actions of the chosen action. We allow the graph structure to vary with time and consider both deterministic and Erdős-Rényi random graph models. For such a graph feedback model, we first present a novel analysis of Thompson sampling that leads to tighter performance bound than existing work. Next, we propose new Information Directed Sampling based policies that are graph-aware in their decision making. Under the deterministic graph case, we establish a Bayesian regret bound for the proposed policies that scales with the clique cover number of the graph instead of the number of actions. Under the random graph case, we provide a Bayesian regret bound for the proposed policies that scales with the ratio of the number of actions over the expected number of observations per iteration. To the best of our knowledge, this is the first analytical result for stochastic bandits with random graph feedback. Finally, using numerical evaluations, we demonstrate that our proposed IDS policies outperform existing approaches, including adaptions of upper confidence bound, $ε$-greedy and Exp3 algorithms.
GTJan 30, 2017
Security Game with Non-additive Utilities and Multiple Attacker ResourcesSinong Wang, Ness Shroff
There has been significant interest in studying security games for modeling the interplay of attacks and defenses on various systems involving critical infrastructure, financial system security, political campaigns, and civil safeguarding. However, existing security game models typically either assume additive utility functions, or that the attacker can attack only one target. Such assumptions lead to tractable analysis, but miss key inherent dependencies that exist among different targets in current complex networks. In this paper, we generalize the classical security game models to allow for non-additive utility functions. We also allow attackers to be able to attack multiple targets. We examine such a general security game from a theoretical perspective and provide a unified view. In particular, we show that each security game is equivalent to a combinatorial optimization problem over a set system $\varepsilon$, which consists of defender's pure strategy space. The key technique we use is based on the transformation, projection of a polytope, and the elipsoid method. This work settles several open questions in security game domain and significantly extends the state of-the-art of both the polynomial solvable and NP-hard class of the security game.
GTMar 2, 2016
Non-additive Security GamesSinong Wang, Fang Liu, Ness Shroff
We have investigated the security game under non-additive utility functions.