LGJul 25, 2022
Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured TransitionsShuang Qiu, Xiaohan Wei, Jieping Ye et al.
While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.
IRMar 11, 2022
DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate PredictionBuyun Zhang, Liang Luo, Xi Liu et al.
Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN - a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27\% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.
92.8CLApr 24
Learning Evidence Highlighting for Frozen LLMsShaoang Li, Yanhang Shi, Yufei Li et al.
Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
41.3LGApr 13
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference ScalingZikun Liu, Liang Luo, Qianru Li et al.
Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta's advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.
87.3LGMay 11
LoKA: Low-precision Kernel Applications for Recommendation Models At ScaleLiang Luo, Yinbin Ma, Quanyu Zhu et al.
Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.
36.4ROMar 20
Evolving Embodied Intelligence: Graph Neural Network--Driven Co-Design of Morphology and Control in Soft RoboticsJianqiang Wang, Shuaiqun Pan, Alvaro Serra-Gomez et al.
The intelligent behavior of robots does not emerge solely from control systems, but from the tight coupling between body and brain, a principle known as embodied intelligence. Designing soft robots that leverage this interaction remains a significant challenge, particularly when morphology and control require simultaneous optimization. A significant obstacle in this co-design process is that morphological evolution can disrupt learned control strategies, making it difficult to reuse or adapt existing knowledge. We address this by develop a Graph Neural Network-based approach for the co-design of morphology and controller. Each robot is represented as a graph, with a graph attention network (GAT) encoding node features and a pooled representation passed through a multilayer perceptron (MLP) head to produce actuator commands or value estimates. During evolution, inheritance follows a topology-consistent mapping: shared GAT layers are reused, MLP hidden layers are transferred intact, matched actuator outputs are copied, and unmatched ones are randomly initialized and fine-tuned. This morphology-aware policy class lets the controller adapt when the body mutates. On the benchmark, our GAT-based approach achieves higher final fitness and stronger adaptability to morphological variations compared to traditional MLP-only co-design methods. These results indicate that graph-structured policies provide a more effective interface between evolving morphologies and control for embodied intelligence.
IRApr 2, 2025
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic IDCarolina Zheng, Minhui Huang, Dmitrii Pedchenko et al.
The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability. This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.
IRFeb 20, 2025
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads RecommendationMingfu Liang, Xi Liu, Rong Jin et al.
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.
IRJan 9, 2024
Fine-Grained Embedding Dimension Optimization During Training for Recommender SystemsQinyi Luo, Penghan Wang, Wei Zhang et al.
Huge embedding tables in modern deep learning recommender models (DLRM) require prohibitively large memory during training and inference. This paper proposes FIITED, a system to automatically reduce the memory footprint via FIne-grained In-Training Embedding Dimension pruning. By leveraging the key insight that embedding vectors are not equally important, FIITED adaptively adjusts the dimension of each individual embedding vector during model training, assigning larger dimensions to more important embeddings while adapting to dynamic changes in data. We prioritize embedding dimensions with higher frequencies and gradients as more important. To enable efficient pruning of embeddings and their dimensions during model training, we propose an embedding storage system based on virtually-hashed physically-indexed hash tables. Experiments on two industry models and months of realistic datasets show that FIITED can reduce DLRM embedding size by more than 65% while preserving model quality, outperforming state-of-the-art in-training embedding pruning methods. On public datasets, FIITED can reduce the size of embedding tables by 2.1x to 800x with negligible accuracy drop, while improving model throughput.
LGMar 8
Feed m Birds with One Scone: Accelerating Multi-task Gradient Balancing via Bi-level OptimizationXuxing Chen, Yun He, Jiayi Xu et al.
In machine learning, the goal of multi-task learning (MTL) is to optimize multiple objectives together. Recent works, for example, Multiple Gradient Descent Algorithm (MGDA) and its variants, show promising results with dynamically adjusted weights for different tasks to mitigate conflicts that may potentially degrade the performance on certain tasks. Despite the empirical success of MGDA-type methods, one major limitation of such methods is their computational inefficiency, as they require access to all task gradients. In this paper we introduce MARIGOLD, a unified algorithmic framework for efficiently solving MTL problems. Our method reveals that multi-task gradient balancing methods have a hierarchical structure, in which the model training and the gradient balancing are coupled during the whole optimization process and can be viewed as a bi-level optimization problem. Moreover, we showcase that the bi-level problem can be solved efficiently by leveraging zeroth-order method. Extensive experiments on both public datasets and industrial-scale datasets demonstrate the efficiency and superiority of our method.
LGMay 31, 2023
Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement LearningDongsheng Ding, Xiaohan Wei, Zhuoran Yang et al.
We examine online safe multi-agent reinforcement learning using constrained Markov games in which agents compete by maximizing their expected total rewards under a constraint on expected total utilities. Our focus is confined to an episodic two-player zero-sum constrained Markov game with independent transition functions that are unknown to agents, adversarial reward functions, and stochastic utility functions. For such a Markov game, we employ an approach based on the occupancy measure to formulate it as an online constrained saddle-point problem with an explicit constraint. We extend the Lagrange multiplier method in constrained optimization to handle the constraint by creating a generalized Lagrangian with minimax decision primal variables and a dual variable. Next, we develop an upper confidence reinforcement learning algorithm to solve this Lagrangian problem while balancing exploration and exploitation. Our algorithm updates the minimax decision primal variables via online mirror descent and the dual variable via projected gradient step and we prove that it enjoys sublinear rate $ O((|X|+|Y|) L \sqrt{T(|A|+|B|)}))$ for both regret and constraint violation after playing $T$ episodes of the game. Here, $L$ is the horizon of each episode, $(|X|,|A|)$ and $(|Y|,|B|)$ are the state/action space sizes of the min-player and the max-player, respectively. To the best of our knowledge, we provide the first provably efficient online safe reinforcement learning algorithm in constrained Markov games.
LGOct 10, 2021
Frequency-aware SGD for Efficient Embedding Learning with Provable BenefitsYan Li, Dhruv Choudhary, Xiaohan Wei et al.
Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
LGMay 26, 2021
Low-Precision Hardware Architectures Meet Recommendation Model Inference at ScaleZhaoxia, Deng, Jongsoo Park et al.
Tremendous success of machine learning (ML) and the unabated growth in ML model complexity motivated many ML-specific designs in both CPU and accelerator architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Impressive compute throughputs are indeed often exhibited by these architectures on benchmark ML models. Nevertheless, production models such as recommendation systems important to Facebook's personalization services are demanding and complex: These systems must serve billions of users per month responsively with low latency while maintaining high prediction accuracy, notwithstanding computations with many tens of billions parameters per inference. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan during which topic trends and users' interests inevitably evolve. Practicing these low-precision technologies helped us save datacenter capacities while deploying models with up to 5X complexity that would otherwise not be deployed on traditional general-purpose CPUs. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering and advance the state of the art of ML in industry.
LGOct 18, 2020
Training Recommender Systems at Scale: Communication-Efficient Model and Data ParallelismVipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang et al.
In this paper, we consider hybrid parallelism -- a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP) -- to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least $100\times$ and $20\times$ during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37\%, without any loss in performance.
LGAug 23, 2020
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD LearningShuang Qiu, Zhuoran Yang, Xiaohan Wei et al.
Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.
OCJun 22, 2020
Gradient-Variation Bound for Online Convex Optimization with ConstraintsShuang Qiu, Xiaohan Wei, Mladen Kolar
We study online convex optimization with constraints consisting of multiple functional constraints and a relatively simple constraint set, such as a Euclidean ball. As enforcing the constraints at each time step through projections is computationally challenging in general, we allow decisions to violate the functional constraints but aim to achieve a low regret and cumulative violation of the constraints over a horizon of $T$ time steps. First-order methods achieve an $\mathcal{O}(\sqrt{T})$ regret and an $\mathcal{O}(1)$ constraint violation, which is the best-known bound under the Slater's condition, but do not take into account the structural information of the problem. Furthermore, the existing algorithms and analysis are limited to Euclidean space. In this paper, we provide an \emph{instance-dependent} bound for online convex optimization with complex constraints obtained by a novel online primal-dual mirror-prox algorithm. Our instance-dependent regret is quantified by the total gradient variation $V_*(T)$ in the sequence of loss functions. The proposed algorithm works in \emph{general} normed spaces and simultaneously achieves an $\mathcal{O}(\sqrt{V_*(T)})$ regret and an $\mathcal{O}(1)$ constraint violation, which is never worse than the best-known $( \mathcal{O}(\sqrt{T}), \mathcal{O}(1) )$ result and improves over previous works that applied mirror-prox-type algorithms for this problem achieving $\mathcal{O}(T^{2/3})$ regret and constraint violation. Finally, our algorithm is computationally efficient, as it only performs mirror descent steps in each iteration instead of solving a general Lagrangian minimization problem.
LGMar 2, 2020
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial LossShuang Qiu, Xiaohan Wei, Zhuoran Yang et al.
We consider online learning for episodic stochastically constrained Markov decision processes (CMDPs), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, and both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDPs) is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space $\mathcal{S}$ and the action space $\mathcal{A}$. In this work, we propose a new \emph{upper confidence primal-dual} algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves $\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T})$ upper bounds of both the regret and the constraint violation, where $L$ is the length of each episode. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of "optimism in the face of uncertainty" in constrained online learning.
LGMar 1, 2020
Provably Efficient Safe Exploration via Primal-Dual Policy OptimizationDongsheng Ding, Xiaohan Wei, Zhuoran Yang et al.
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing SRL algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization (OPDOP) algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
STAug 14, 2019
Robust One-Bit Recovery via ReLU Generative Networks: Near-Optimal Statistical Rate and Global Landscape AnalysisShuang Qiu, Xiaohan Wei, Zhuoran Yang
We study the robust one-bit compressed sensing problem whose goal is to design an algorithm that faithfully recovers any sparse target vector $θ_0\in\mathbb{R}^d$ \textit{uniformly} via $m$ quantized noisy measurements. Specifically, we consider a new framework for this problem where the sparsity is implicitly enforced via mapping a low dimensional representation $x_0 \in \mathbb{R}^k$ through a known $n$-layer ReLU generative network $G:\mathbb{R}^k\rightarrow\mathbb{R}^d$ such that $θ_0 = G(x_0)$. Such a framework poses low-dimensional priors on $θ_0$ without a known sparsity basis. We propose to recover the target $G(x_0)$ solving an unconstrained empirical risk minimization (ERM). Under a weak \textit{sub-exponential measurement assumption}, we establish a joint statistical and computational analysis. In particular, we prove that the ERM estimator in this new framework achieves a statistical rate of $m=\widetilde{\mathcal{O}}(kn \log d /\varepsilon^2)$ recovering any $G(x_0)$ uniformly up to an error $\varepsilon$. When the network is shallow (i.e., $n$ is small), we show this rate matches the information-theoretic lower bound up to logarithm factors of $\varepsilon^{-1}$. From the lens of computation, we prove that under proper conditions on the network weights, our proposed empirical risk, despite non-convexity, has no stationary point outside of small neighborhoods around the true representation $x_0$ and its negative multiple; furthermore, we show that the global minimizer of the empirical risk stays within the neighborhood around $x_0$ rather than its negative multiple under further assumptions on the network weights.
OCAug 7, 2019
Fast Multi-Agent Temporal-Difference Learning via Homotopy Stochastic Primal-Dual OptimizationDongsheng Ding, Xiaohan Wei, Zhuoran Yang et al.
We study the policy evaluation problem in multi-agent reinforcement learning where a group of agents, with jointly observed states and private local actions and rewards, collaborate to learn the value function of a given policy via local computation and communication over a connected undirected network. This problem arises in various large-scale multi-agent systems, including power grids, intelligent transportation systems, wireless sensor networks, and multi-agent robotics. When the dimension of state-action space is large, the temporal-difference learning with linear function approximation is widely used. In this paper, we develop a new distributed temporal-difference learning algorithm and quantify its finite-time performance. Our algorithm combines a distributed stochastic primal-dual method with a homotopy-based approach to adaptively adjust the learning rate in order to minimize the mean-square projected Bellman error by taking fresh online samples from a causal on-policy trajectory. We explicitly take into account the Markovian nature of sampling and improve the best-known finite-time error bound from $O(1/\sqrt{T})$ to~$O(1/T)$, where $T$ is the total number of iterations.
OCAug 12, 2017
Online Convex Optimization with Stochastic ConstraintsHao Yu, Michael J. Neely, Xiaohan Wei
This paper considers online convex optimization (OCO) with stochastic constraints, which generalizes Zinkevich's OCO over a known simple fixed set by introducing multiple stochastic functional constraints that are i.i.d. generated at each round and are disclosed to the decision maker only after the decision is made. This formulation arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy observations. It also includes many important problems as special cases, such as OCO with long term constraints, stochastic constrained convex optimization, and deterministic constrained convex optimization. To solve this problem, this paper proposes a new algorithm that achieves $O(\sqrt{T})$ expected regret and constraint violations and $O(\sqrt{T}\log(T))$ high probability regret and constraint violations. Experiments on a real-world data center scheduling problem further verify the performance of the new algorithm.