LGMay 20
Improved Guarantees for Constrained Online Convex Optimization via Self-ContractionDhruv Sarkar, Abhishek Sinha
We consider Constrained Online Convex Optimization (COCO) with adversarially chosen constraints. At each round, the learner chooses an action before observing the loss and constraint function for that round. The goal is to achieve small static regret against the best point satisfying all constraints while also controlling cumulative constraint violation ($\mathsf{CCV}$). For strongly convex losses, state-of-the-art algorithms achieve $O(\log T)$ regret and $O(\sqrt{T \log T})$ $\mathsf{CCV}.$ The corresponding best-known bounds for convex losses is $O(\sqrt{T})$ regret and $O(\sqrt{T} \log T)$ $\mathsf{CCV}$. In this paper, we give a simple projection-based algorithm that simultaneously achieves $O(\log T)$ regret and $O(\log T)$ $\mathsf{CCV}$ for strongly-convex losses, yielding an exponential improvement in the $\mathsf{CCV}$. For the convex losses, our algorithm improves the $\mathsf{CCV}$ to $O(\sqrt{T})$ while maintaining the optimal $O(\sqrt{T})$ regret. The key to our improvement is a recent geometric result for self-contracted curves, which may be of independent interest.
LGJan 30
Improved Algorithms for Nash Welfare in Linear BanditsDhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury
Nash regret has recently emerged as a principled fairness-aware performance metric for stochastic multi-armed bandits, motivated by the Nash Social Welfare objective. Although this notion has been extended to linear bandits, existing results suffer from suboptimality in ambient dimension $d$, stemming from proof techniques that rely on restrictive concentration inequalities. In this work, we resolve this open problem by introducing new analytical tools that yield an order-optimal Nash regret bound in linear bandits. Beyond Nash regret, we initiate the study of $p$-means regret in linear bandits, a unifying framework that interpolates between fairness and utility objectives and strictly generalizes Nash regret. We propose a generic algorithmic framework, FairLinBandit, that works as a meta-algorithm on top of any linear bandit strategy. We instantiate this framework using two bandit algorithms: Phased Elimination and Upper Confidence Bound, and prove that both achieve sublinear $p$-means regret for the entire range of $p$. Extensive experiments on linear bandit instances generated from real-world datasets demonstrate that our methods consistently outperform the existing state-of-the-art baseline.
LGMay 7
Constrained Contextual Bandits with Adversarial ContextsDhruv Sarkar, Abhishek Sinha
We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We focus on the continuing setting, in which the algorithm operates over the entire horizon even after the budget for cumulative cost is exhausted. In this setting, the objective is to simultaneously control regret and the violation of the budget constraint. Building on the seminal $\mathsf{SquareCB}$ framework of Foster et al. [2018], we propose a simple and modular framework that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to prior works, which focus on stochastic contexts, our reduction yields improved guarantees for more general adversarial contexts, together with an efficient algorithm with a compact and transparent analysis.
LGJan 28, 2025
Projection-free Algorithms for Online Convex Optimization with Adversarial ConstraintsDhruv Sarkar, Aprameyo Chakrabartty, Subhamon Supantha et al.
We study a generalization of the Online Convex Optimization (OCO) framework with time-varying adversarial constraints. In this problem, after selecting a feasible action from the convex decision set $X,$ a convex constraint function is revealed alongside the cost function in each round. Our goal is to design a computationally efficient learning policy that achieves a small regret with respect to the cost functions and a small cumulative constraint violation (CCV) with respect to the constraint functions over a horizon of length $T$. It is well-known that the projection step constitutes the major computational bottleneck of the standard OCO algorithms. However, for many structured decision sets, linear functions can be efficiently optimized over the decision set. We propose a *projection-free* online policy which makes a single call to a Linear Program (LP) solver per round. Our method outperforms state-of-the-art projection-free online algorithms with adversarial constraints, achieving improved bounds of $\tilde{O}(T^{\frac{3}{4}})$ for both regret and CCV. The proposed algorithm is conceptually simple - it first constructs a surrogate cost function as a non-negative linear combination of the cost and constraint functions. Then, it passes the surrogate costs to a new, adaptive version of the online conditional gradient subroutine, which we propose in this paper.
LGAug 23, 2025
Online Learning for Approximately-Convex Functions with Long-term Adversarial ConstraintsDhruv Sarkar, Samrat Mukhopadhyay, Abhishek Sinha
We study an online learning problem with long-term budget constraints in the adversarial setting. In this problem, at each round $t$, the learner selects an action from a convex decision set, after which the adversary reveals a cost function $f_t$ and a resource consumption function $g_t$. The cost and consumption functions are assumed to be $α$-approximately convex - a broad class that generalizes convexity and encompasses many common non-convex optimization problems, including DR-submodular maximization, Online Vertex Cover, and Regularized Phase Retrieval. The goal is to design an online algorithm that minimizes cumulative cost over a horizon of length $T$ while approximately satisfying a long-term budget constraint of $B_T$. We propose an efficient first-order online algorithm that guarantees $O(\sqrt{T})$ $α$-regret against the optimal fixed feasible benchmark while consuming at most $O(B_T \log T)+ \tilde{O}(\sqrt{T})$ resources in both full-information and bandit feedback settings. In the bandit feedback setting, our approach yields an efficient solution for the $\texttt{Adversarial Bandits with Knapsacks}$ problem with improved guarantees. We also prove matching lower bounds, demonstrating the tightness of our results. Finally, we characterize the class of $α$-approximately convex functions and show that our results apply to a broad family of problems.
LGAug 5, 2025
DP-NCB: Privacy Preserving Fair BanditsDhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury
Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns. To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)-a novel and unified algorithmic framework that simultaneously ensures $ε$-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications.
LGFeb 4
A Simple Reduction Scheme for Constrained Contextual Bandits with Adversarial Contexts via RegressionDhruv Sarkar, Abhishek Sinha
We study constrained contextual bandits (CCB) with adversarially chosen contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We consider the continuing setting, in which the algorithm operates over the entire horizon even after the budget is exhausted. In this setting, the objective is to simultaneously control regret and cumulative constraint violation. Building on the seminal SquareCB framework of Foster et al. (2018), we propose a simple and modular algorithmic scheme that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to most prior work on CCB, which focuses on stochastic contexts, our reduction yields improved guarantees for the more general adversarial context setting, together with a compact and transparent analysis.
LGOct 26, 2025
Optimal Anytime Algorithms for Online Convex Optimization with Adversarial ConstraintsDhruv Sarkar, Abhishek Sinha
We propose an anytime online algorithm for the problem of learning a sequence of adversarial convex cost functions while approximately satisfying another sequence of adversarial online convex constraints. A sequential algorithm is called \emph{anytime} if it provides a non-trivial performance guarantee for any intermediate timestep $t$ without requiring prior knowledge of the length of the entire time horizon $T$. Our proposed algorithm achieves optimal performance bounds without resorting to the standard doubling trick, which has poor practical performance due to multiple restarts. Our core technical contribution is the use of time-varying Lyapunov functions to keep track of constraint violations. This must be contrasted with prior works that used a fixed Lyapunov function tuned to the known horizon length $T$. The use of time-varying Lyapunov function poses unique analytical challenges as properties, such as \emph{monotonicity}, on which the prior proofs rest, no longer hold. By introducing a new analytical technique, we show that our algorithm achieves $O(\sqrt{t})$ regret and $\tilde{O}(\sqrt{t})$ cumulative constraint violation bounds for any $t\geq 1$. We extend our results to the dynamic regret setting, achieving bounds that adapt to the path length of the comparator sequence without prior knowledge of its total length. We also present an adaptive algorithm in the optimistic setting, whose performance gracefully scales with the cumulative prediction error. We demonstrate the practical utility of our algorithm through numerical experiments involving the online shortest path problem.
LGOct 24, 2025
Revisiting Social Welfare in Bandits: UCB is (Nearly) All You NeedDhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury
Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds.
CVJul 26, 2025
TAPS : Frustratingly Simple Test Time Active Learning for VLMsDhruv Sarkar, Aprameyo Chakrabartty, Bibhudatta Bhanja
Test-Time Optimization enables models to adapt to new data during inference by updating parameters on-the-fly. Recent advances in Vision-Language Models (VLMs) have explored learning prompts at test time to improve performance in downstream tasks. In this work, we extend this idea by addressing a more general and practical challenge: Can we effectively utilize an oracle in a continuous data stream where only one sample is available at a time, requiring an immediate query decision while respecting latency and memory constraints? To tackle this, we propose a novel Test-Time Active Learning (TTAL) framework that adaptively queries uncertain samples and updates prompts dynamically. Unlike prior methods that assume batched data or multiple gradient updates, our approach operates in a real-time streaming scenario with a single test sample per step. We introduce a dynamically adjusted entropy threshold for active querying, a class-balanced replacement strategy for memory efficiency, and a class-aware distribution alignment technique to enhance adaptation. The design choices are justified using careful theoretical analysis. Extensive experiments across 10 cross-dataset transfer benchmarks and 4 domain generalization datasets demonstrate consistent improvements over state-of-the-art methods while maintaining reasonable latency and memory overhead. Our framework provides a practical and effective solution for real-world deployment in safety-critical applications such as autonomous systems and medical diagnostics.
MLJul 17, 2025
Relation-Aware Slicing in Cross-Domain AlignmentDhruv Sarkar, Aprameyo Chakrabartty, Anish Chakrabarty et al.
The Sliced Gromov-Wasserstein (SGW) distance, aiming to relieve the computational cost of solving a non-convex quadratic program that is the Gromov-Wasserstein distance, utilizes projecting directions sampled uniformly from unit hyperspheres. This slicing mechanism incurs unnecessary computational costs due to uninformative directions, which also affects the representative power of the distance. However, finding a more appropriate distribution over the projecting directions (slicing distribution) is often an optimization problem in itself that comes with its own computational cost. In addition, with more intricate distributions, the sampling itself may be expensive. As a remedy, we propose an optimization-free slicing distribution that provides fast sampling for the Monte Carlo approximation. We do so by introducing the Relation-Aware Projecting Direction (RAPD), effectively capturing the pairwise association of each of two pairs of random vectors, each following their ambient law. This enables us to derive the Relation-Aware Slicing Distribution (RASD), a location-scale law corresponding to sampled RAPDs. Finally, we introduce the RASGW distance and its variants, e.g., IWRASGW (Importance Weighted RASGW), which overcome the shortcomings experienced by SGW. We theoretically analyze its properties and substantiate its empirical prowess using extensive experiments on various alignment tasks.