Dabeen Lee

LG
h-index54
21papers
37citations
Novelty62%
AI Score56

21 Papers

OCJan 26, 2023
Online Convex Optimization with Stochastic Constraints: Zero Constraint Violation and Bandit Feedback

Yeongjong Kim, Dabeen Lee

This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.

49.0LGMay 12
Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses

Kihyun Yu, Seoungbin Bae, Dabeen Lee

Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

54.6LGMay 12
Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

Kihyun Yu, Beomhan Baek, Dabeen Lee

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

35.3LGMay 11
Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs

Hoyeol Yoon, Seoungbin Bae, Nam Ho-Nguyen et al.

Physics-informed neural networks (PINNs) are a promising approach for solving partial differential equations (PDEs). Their training, however, is often difficult because multiple loss terms induced by PDE residuals and boundary or initial conditions must be optimized simultaneously. To address this difficulty, existing approaches often construct update directions by explicitly enforcing particular desirable properties, such as scale robustness and simultaneous descent. While effective in many cases, such property-by-property designs can make it unclear which conditions are essential, what geometric principle determines the selected update direction, and how different methods are structurally related. In this work, we formulate update-direction selection for PINN training as a Chebyshev-center problem in the dual cone. The proposed formulation selects a normalized direction that maximizes the minimum distance to the cone facets. The resulting formulation admits an efficient dual problem in a much lower-dimensional space and yields a convergence guarantee in the nonconvex setting. It also recovers the key desirable properties targeted by existing approaches without imposing them separately; rather, they follow from the single geometric criterion underlying the formulation. This makes the selected direction interpretable through a single geometric rule and provides a unified basis for systematically comparing related direction-selection methods. Experiments on several PINN benchmarks further demonstrate strong empirical performance of the proposed method.

49.0LGApr 24
Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

Seoungbin Bae, Dabeen Lee

We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound rely heavily on context diversity assumptions, such as strict positivity of the minimum eigenvalue of a context covariance matrix. These assumptions, however, impose strong restrictions on the context process, as they rule out the situation where the context vectors are concentrated in a low-dimensional subspace. In this paper, we propose SupSplitLog, which, to the best of our knowledge, is the first algorithm for logistic bandits that achieves $\tilde{\mathcal{O}}(\sqrt{dT})$ regret without any context diversity assumption. The key idea is to split the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. Moreover, SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension $d$ in the regret upper bound. Furthermore, SupSplitLog can be adapted simply to deduce a regret bound that grows with a data-dependent complexity measure, avoiding a direct dependence on $d$, which is favorable when the context vectors are concentrated in a low-dimensional subspace. We also provide experimental results that demonstrate numerically the superiority of our algorithm, validating the theoretical results.

48.2LGMar 29
Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards

Kihyun Yu, Seoungbin Bae, Dabeen Lee

We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.

LGSep 16, 2024
Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Woojin Chae, Dabeen Lee

This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of $\widetilde{\mathcal{O}}(d^{3/2}\mathrm{sp}(v^*)\sqrt{T})$ over $T$ time steps where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of $\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v^*)\sqrt{T})$. The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.

LGJan 27
Queue Length Regret Bounds for Contextual Queueing Bandits

Seoungbin Bae, Garyeong Kang, Dabeen Lee

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.

LGFeb 2
Learning to Route and Schedule LLMs from User Retrials via Contextual Queueing Bandits

Seoungbin Bae, Junyoung Son, Dabeen Lee

Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for ``explicit" feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages ``implicit" feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{t})$ for routing and a queue length regret of $\widetilde{\mathcal{O}}(t^{-1/4})$ for any large $t$. For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on SPROUT, EmbedLLM, and RouterBench datasets confirm that both algorithms consistently outperform baselines.

MLMay 23, 2024
Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Kihyuk Hong, Woojin Chae, Yufan Zhang et al.

We study the problem of infinite-horizon average-reward reinforcement learning with linear Markov decision processes (MDPs). The associated Bellman operator of the problem not being a contraction makes the algorithm design challenging. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $\widetilde{O}(\sqrt{T})$. In this paper, we propose the first algorithm that achieves $\widetilde{O}(\sqrt{T})$ regret with computational complexity polynomial in the problem parameters, without making strong assumptions on dynamics. Our approach approximates the average-reward setting by a discounted MDP with a carefully chosen discounting factor, and then applies an optimistic value iteration. We propose an algorithmic structure that plans for a nonstationary policy through optimistic value iteration and follows that policy until a specified information metric in the collected data doubles. Additionally, we introduce a value function clipping procedure for limiting the span of the value function for sample efficiency.

LGOct 19, 2024
Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span

Woojin Chae, Kihyuk Hong, Yufan Zhang et al.

This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs) under the Bellman optimality condition. Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $\widetilde{\mathcal{O}}(d\sqrt{\mathrm{sp}(v^*)T})$ over $T$ time steps where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. Our algorithm applies the recently developed technique of running value iteration on a discounted-reward MDP approximation with clipping by the span. We prove that the value iteration procedure, even with the clipping operation, converges. Moreover, we show that the associated variance term due to random transitions can be bounded even under clipping. Combined with the weighted ridge regression-based parameter estimation scheme, this leads to the nearly minimax optimal regret guarantee.

LGOct 14, 2024
Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Kihyun Yu, Duksang Lee, William Overman et al.

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of $\widetilde{\mathcal{O}}((\bar C - \bar C_b)^{-1}H^{2.5} S\sqrt{AK})$ where $\bar C$ is the cost budget for an episode, $\bar C_b$ is the expected cost under a safe baseline policy over an episode, $H$ is the horizon, and $S$, $A$ and $K$ are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when $\bar C- \bar C_b=Ω(H)$, it nearly matches the regret lower bound of $Ω(H^{1.5}\sqrt{SAK})$. We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

LGMay 28, 2025
An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints

Jiahui Zhu, Kihyun Yu, Dabeen Lee et al.

Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret O(sqrt(K)) and strong constraint violation O(sqrt(K)) without relying on Slater's condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.

LGMay 4, 2025
Neural Logistic Bandits

Seoungbin Bae, Dabeen Lee

We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencies on $κ$, where $1/κ$ represents the minimum variance of reward distributions, or suffer from direct dependence on the feature dimension $d$, which can be huge in neural network-based settings. In this work, we introduce a novel Bernstein-type inequality for self-normalized vector-valued martingales that is designed to bypass a direct dependence on the ambient dimension. This lets us deduce a regret upper bound that grows with the effective dimension $\widetilde{d}$, not the feature dimension, while keeping a minimal dependence on $κ$. Based on the concentration inequality, we propose two algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, that guarantee regret upper bounds of order $\widetilde{O}(\widetilde{d}\sqrt{κT})$ and $\widetilde{O}(\widetilde{d}\sqrt{T/κ})$, respectively, improving on the existing results. Lastly, we report numerical results on both synthetic and real datasets to validate our theoretical findings.

LGJun 19, 2024
Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Jaehyun Park, Junyeop Kwon, Dabeen Lee

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-γ)^{-2}\sqrt{T})$ regret where $γ$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $Ω(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $Ω(d(1-γ)^{3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $γ$. Lastly, we show a regret lower bound of $Ω(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

LGFeb 23, 2024
Parameter-Free Algorithms for Performative Regret Minimization under Decision-Dependent Distributions

Sungwoo Park, Junyeop Kwon, Byeongnoh Kim et al.

This paper studies performative risk minimization, a formulation of stochastic optimization under decision-dependent distributions. We consider the general case where the performative risk can be non-convex, for which we develop efficient parameter-free optimistic optimization-based methods. Our algorithms significantly improve upon the existing Lipschitz bandit-based method in many aspects. In particular, our framework does not require knowledge about the sensitivity parameter of the distribution map and the Lipshitz constant of the loss function. This makes our framework practically favorable, together with the efficient optimistic optimization-based tree-search mechanism. We provide experimental results that demonstrate the numerical superiority of our algorithms over the existing method and other black-box optimistic optimization methods.

OCDec 7, 2023
Stochastic-Constrained Stochastic Optimization with Markovian Data

Yeongjong Kim, Dabeen Lee

This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose two variants of drift-plus-penalty; one is for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. Both algorithms are adaptive in that the first works without knowledge of the time horizon while the second uses AdaGrad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints.

DSMay 18, 2023
Online Resource Allocation in Episodic Markov Decision Processes

Duksang Lee, William Overman, Dabeen Lee

This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde O(ρ^{-1}{H^{3/2}}S\sqrt{AT})$ where $ρ\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $\tilde O(ρ^{-1}{H^{3/2}}S\sqrt{AT})$ with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.

OCMay 2, 2023
Projection-Free Online Convex Optimization with Stochastic Constraints

Duksang Lee, Nam Ho-Nguyen, Dabeen Lee

This paper develops projection-free algorithms for online convex optimization with stochastic constraints. We design an online primal-dual projection-free framework that can take any projection-free algorithms developed for online convex optimization with no long-term constraint. With this general template, we deduce sublinear regret and constraint violation bounds for various settings. Moreover, for the case where the loss and constraint functions are smooth, we develop a primal-dual conditional gradient method that achieves $O(\sqrt{T})$ regret and $O(T^{3/4})$ constraint violations. Furthermore, for the setting where the loss and constraint functions are stochastic and strong duality holds for the associated offline stochastic optimization problem, we prove that the constraint violation can be reduced to have the same asymptotic growth as the regret.

LGMay 28, 2021
Scheduling Jobs with Stochastic Holding Costs

Dabeen Lee, Milan Vojnovic

We study a single-server scheduling problem for the objective of minimizing the expected cumulative holding cost incurred by jobs, where parameters defining stochastic job holding costs are unknown to the scheduler. We consider a general setting allowing for different job classes, where jobs of the same class have statistically identical holding costs and service times, with an arbitrary number of jobs across classes. In each time step, the server can process a job and observes random holding costs of the jobs that are yet to be completed. We consider a learning-based $cμ$ rule scheduling which starts with a preemption period of fixed duration, serving as a learning phase, and having gathered data about jobs, it switches to nonpreemptive scheduling. Our algorithms are designed to handle instances with large and small gaps in mean job holding costs and achieve near-optimal performance guarantees. The performance of algorithms is evaluated by regret, where the benchmark is the minimum possible total holding cost attained by the $cμ$ rule scheduling policy when the parameters of jobs are known. We show regret lower bounds and algorithms that achieve nearly matching regret upper bounds. Our numerical results demonstrate the efficacy of our algorithms and show that our regret analysis is nearly tight.

DSDec 30, 2020
Test Score Algorithms for Budgeted Stochastic Utility Maximization

Dabeen Lee, Milan Vojnovic, Se-Young Yun

Motivated by recent developments in designing algorithms based on individual item scores for solving utility maximization problems, we study the framework of using test scores, defined as a statistic of observed individual item performance data, for solving the budgeted stochastic utility maximization problem. We extend an existing scoring mechanism, namely the replication test scores, to incorporate heterogeneous item costs as well as item values. We show that a natural greedy algorithm that selects items solely based on their replication test scores outputs solutions within a constant factor of the optimum for a broad class of utility functions. Our algorithms and approximation guarantees assume that test scores are noisy estimates of certain expected values with respect to marginal distributions of individual item values, thus making our algorithms practical and extending previous work that assumes noiseless estimates. Moreover, we show how our algorithm can be adapted to the setting where items arrive in a streaming fashion while maintaining the same approximation guarantee. We present numerical results, using synthetic data and data sets from the Academia.StackExchange Q&A forum, which show that our test score algorithm can achieve competitiveness, and in some cases better performance than a benchmark algorithm that requires access to a value oracle to evaluate function values.