Arnab Maiti

LG
h-index9
14papers
78citations
Novelty61%
AI Score57

14 Papers

LGMay 27, 2022
Fairness and Welfare Quantification for Regret in Multi-Armed Bandits

Siddharth Barman, Arindam Khan, Arnab Maiti et al. · stanford

We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play $T$, achieves a Nash regret of $O \left( \sqrt{\frac{k \log T}{T}} \right)$, here $k$ denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of $O \left( \sqrt{\frac{k\log T}{T}} \log T \right)$.

GTMar 19, 2023
Instance-dependent Sample Complexity Bounds for Zero-sum Matrix Games

Arnab Maiti, Kevin Jamieson, Lillian J. Ratliff

We study the sample complexity of identifying an approximate equilibrium for two-player zero-sum $n\times 2$ matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instance-dependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions $i$ and $j$, respectively, they both observe each other's played actions and a stochastic observation $X_{ij}$ such that $\mathbb E[ X_{ij}] = A_{ij}$. To our knowledge, our work is the first case of instance-dependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix $A$ as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.

LGJun 22, 2023
On the Limitations and Possibilities of Nash Regret Minimization in Zero-Sum Matrix Games under Noisy Feedback

Arnab Maiti, Kevin Jamieson, Lillian J. Ratliff

This paper studies a variant of two-player zero-sum matrix games, where, at each timestep, the row player selects row $i$, the column player selects column $j$, and the row player receives a noisy reward with expected value $A_{i,j}$, along with noisy feedback on the input matrix $A$. The row player's goal is to maximize their total reward against an adversarial column player. Nash regret, defined as the difference between the player's total reward and the game's Nash equilibrium value scaled by the time horizon $T$, is often used to evaluate algorithmic performance in zero-sum games. We begin by studying the limitations of existing algorithms for minimizing Nash regret. We show that standard algorithm--including Hedge, FTRL, and OMD--as well as the strategy of playing the Nash equilibrium of the empirical matrix--all incur $Ω(\sqrt{T})$ Nash regret, even when the row player receives noisy feedback on the entire matrix $A$. Furthermore, we show that UCB for matrix games, a natural adaptation of the well-known bandit algorithm, also suffers $Ω(\sqrt{T})$ Nash regret under bandit feedback. Notably, these lower bounds hold even in the simplest case of $2 \times 2$ matrix games, where the instance-dependent matrix parameters are constant. We next ask whether instance-dependent $\text{polylog}(T)$ Nash regret is achievable against adversarial opponents. We answer this affirmatively. In the full-information setting, we present the first algorithm for general $n \times m$ matrix games that achieves instance-dependent $\text{polylog}(T)$ Nash regret. In the bandit feedback setting, we design an algorithm with similar guarantees for the special case of $2 \times 2$ game--the same regime in which existing algorithms provably suffer $Ω(\sqrt{T})$ regret despite the simplicity of the instance. Finally, we validate our theoretical results with empirical evidence.

LGOct 25, 2023
Near-Optimal Pure Exploration in Matrix Games: A Generalization of Stochastic Bandits & Dueling Bandits

Arnab Maiti, Ross Boczar, Kevin Jamieson et al.

We study the sample complexity of identifying the pure strategy Nash equilibrium (PSNE) in a two-player zero-sum matrix game with noise. Formally, we are given a stochastic model where any learner can sample an entry $(i,j)$ of the input matrix $A\in[-1,1]^{n\times m}$ and observe $A_{i,j}+η$ where $η$ is a zero-mean 1-sub-Gaussian noise. The aim of the learner is to identify the PSNE of $A$, whenever it exists, with high probability while taking as few samples as possible. Zhou et al. (2017) presents an instance-dependent sample complexity lower bound that depends only on the entries in the row and column in which the PSNE lies. We design a near-optimal algorithm whose sample complexity matches the lower bound, up to log factors. The problem of identifying the PSNE also generalizes the problem of pure exploration in stochastic multi-armed bandits and dueling bandits, and our result matches the optimal bounds, up to log factors, in both the settings.

LGFeb 6
Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret

Shinji Ito, Haipeng Luo, Arnab Maiti et al.

Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that $Ω(\sqrt{T})$ external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy -- a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent's action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves $O(c \log T)$ instance-dependent regret with a game-dependent parameter $c$. Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form $O(c' \log T)$ for a different game-dependent parameter $c'$ that could potentially be much smaller than $c$. Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds.

87.0LGMay 15
On the Power of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits

Arnab Maiti, Yunbei Xu, Kevin Jamieson

We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $θ\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $\langle \widehat{x},θ\rangle \ge \max_{x\in\mathcal{X}} \langle x,θ\rangle - \varepsilon$ with probability at least $1-δ$, using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/δ)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $Ω\!\left(\frac{d\log(1/δ)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/δ)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|θ\|_2|\le \varepsilon$ with probability at least $1-δ$, where $θ$ is the unknown reward vector.

MLFeb 24
Efficient Uncoupled Learning Dynamics with $\tilde{O}\!\left(T^{-1/4}\right)$ Last-Iterate Convergence in Bilinear Saddle-Point Problems over Convex Sets under Bandit Feedback

Arnab Maiti, Claire Jie Zhang, Kevin Jamieson et al.

In this paper, we study last-iterate convergence of learning algorithms in bilinear saddle-point problems, a preferable notion of convergence that captures the day-to-day behavior of learning dynamics. We focus on the challenging setting where players select actions from compact convex sets and receive only bandit feedback. Our main contribution is the design of an uncoupled learning algorithm that guarantees last-iterate convergence to the Nash equilibrium with high probability. We establish a convergence rate of $\tilde{O}(T^{-1/4})$ up to polynomial factors in problem parameters. Crucially, our proposed algorithm is computationally efficient, requiring only an efficient linear optimization oracle over the players' compact action sets. The algorithm is obtained by combining techniques from experimental design and the classic Follow-The-Regularized-Leader (FTRL) framework, with a carefully chosen regularizer function tailored to the geometry of the action set of each learner.

GTFeb 25
Revisiting the Bertrand Paradox via Equilibrium Analysis of No-regret Learners

Arnab Maiti, Junyan Liu, Kevin Jamieson et al.

We study the discrete Bertrand pricing game with a non-increasing demand function. The game has $n \ge 2$ players who simultaneously choose prices from the set $\{1/k, 2/k, \ldots, 1\}$, where $k\in\mathbb{N}$. The player who sets the lowest price captures the entire demand; if multiple players tie for the lowest price, they split the demand equally. We study the Bertrand paradox, where classical theory predicts low prices, yet real markets often sustain high prices. To understand this gap, we analyze a repeated-game model in which firms set prices using no-regret learners. Our goal is to characterize the equilibrium outcomes that can arise under different no-regret learning guarantees. We are particularly interested in questions such as whether no-external-regret learners can converge to undesirable high-price outcomes, and how stronger guarantees such as no-swap regret shape the emergence of competitive low-price behavior. We address these and related questions through a theoretical analysis, complemented by experiments that support the theory and reveal surprising phenomena for no-swap regret learners.

LGApr 1, 2025
Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries

Arnab Maiti, Zhiyuan Fan, Kevin Jamieson et al.

In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{\mathsf{s}}$ and a sink node $v_{\mathsf{t}}$, let $X \subseteq \{0,1\}^{|E|}$ denote the set of all paths from $v_{\mathsf{s}}$ to $v_{\mathsf{t}}$. At each round $t$, we select a path $\mathbf{x}_t \in X$ and receive bandit feedback on our loss $\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1]$, where $\mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $\tilde O(\sqrt{|E|T\log |X|})$ with high probability against any adaptive adversary, where $\tilde O(\cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.

LGOct 20, 2025
Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Shinji Ito, Kevin Jamieson, Haipeng Luo et al.

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.

GTMay 29, 2025
Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals

Junyan Liu, Arnab Maiti, Artin Tajdini et al.

We initiate the study of a repeated principal-agent problem over a finite horizon $T$, where a principal sequentially interacts with $K\geq 2$ types of agents arriving in an adversarial order. At each round, the principal strategically chooses one of the $N$ arms to incentivize for an arriving agent of unknown type. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of $O(\min\{\sqrt{KT\log N},K\sqrt{T}\})$ and provide a matching lower bound up to a $\log K$ factor. In the second setting, an agent's response varies smoothly with the incentive and is governed by a Lipschitz constant $L\geq 1$. Under this setting, we show that there is an algorithm with a regret bound of $\tilde{O}((LN)^{1/3}T^{2/3})$ and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.

LGOct 20, 2025
On the Universal Near Optimality of Hedge in Combinatorial Settings

Zhiyuan Fan, Arnab Maiti, Kevin Jamieson et al.

In this paper, we study the classical Hedge algorithm in combinatorial settings. In each round, the learner selects a vector $\boldsymbol{x}_t$ from a set $X \subseteq \{0,1\}^d$, observes a full loss vector $\boldsymbol{y}_t \in \mathbb{R}^d$, and incurs a loss $\langle \boldsymbol{x}_t, \boldsymbol{y}_t \rangle \in [-1,1]$. This setting captures several important problems, including extensive-form games, resource allocation, $m$-sets, online multitask learning, and shortest-path problems on directed acyclic graphs (DAGs). It is well known that Hedge achieves a regret of $O\big(\sqrt{T \log |X|}\big)$ after $T$ rounds of interaction. In this paper, we ask whether Hedge is optimal across all combinatorial settings. To that end, we show that for any $X \subseteq \{0,1\}^d$, Hedge is near-optimal--specifically, up to a $\sqrt{\log d}$ factor--by establishing a lower bound of $Ω\big(\sqrt{T \log(|X|)/\log d}\big)$ that holds for any algorithm. We then identify a natural class of combinatorial sets--namely, $m$-sets with $\log d \leq m \leq \sqrt{d}$--for which this lower bound is tight, and for which Hedge is provably suboptimal by a factor of exactly $\sqrt{\log d}$. At the same time, we show that Hedge is optimal for online multitask learning, a generalization of the classical $K$-experts problem. Finally, we leverage the near-optimality of Hedge to establish the existence of a near-optimal regularizer for online shortest-path problems in DAGs--a setting that subsumes a broad range of combinatorial domains. Specifically, we show that the classical Online Mirror Descent (OMD) algorithm, when instantiated with the dilated entropy regularizer, is iterate-equivalent to Hedge, and therefore inherits its near-optimal regret guarantees for DAGs.

LGDec 9, 2020
Streaming Algorithms for Stochastic Multi-armed Bandits

Arnab Maiti, Vishakha Patil, Arindam Khan

We study the Stochastic Multi-armed Bandit problem under bounded arm-memory. In this setting, the arms arrive in a stream, and the number of arms that can be stored in the memory at any time, is bounded. The decision-maker can only pull arms that are present in the memory. We address the problem from the perspective of two standard objectives: 1) regret minimization, and 2) best-arm identification. For regret minimization, we settle an important open question by showing an almost tight hardness. We show Ω(T^{2/3}) cumulative regret in expectation for arm-memory size of (n-1), where n is the number of arms. For best-arm identification, we study two algorithms. First, we present an O(r) arm-memory r-round adaptive streaming algorithm to find an ε-best arm. In r-round adaptive streaming algorithm for best-arm identification, the arm pulls in each round are decided based on the observed outcomes in the earlier rounds. The best-arm is the output at the end of r rounds. The upper bound on the sample complexity of our algorithm matches with the lower bound for any r-round adaptive streaming algorithm. Secondly, we present a heuristic to find the ε-best arm with optimal sample complexity, by storing only one extra arm in the memory.

DSNov 18, 2016
Query Complexity of Tournament Solutions

Arnab Maiti, Palash Dey

A directed graph where there is exactly one edge between every pair of vertices is called a {\em tournament}. Finding the "best" set of vertices of a tournament is a well studied problem in social choice theory. A {\em tournament solution} takes a tournament as input and outputs a subset of vertices of the input tournament. However, in many applications, for example, choosing the best set of drugs from a given set of drugs, the edges of the tournament are given only implicitly and knowing the orientation of an edge is costly. In such scenarios, we would like to know the best set of vertices (according to some tournament solution) by "querying" as few edges as possible. We, in this paper, precisely study this problem for commonly used tournament solutions: given an oracle access to the edges of a tournament T, find $f(T)$ by querying as few edges as possible, for a tournament solution f. We first show that the set of Condorcet non-losers in a tournament can be found by querying $2n-\lfloor \log n \rfloor -2$ edges only and this is tight in the sense that every algorithm for finding the set of Condorcet non-losers needs to query at least $2n-\lfloor \log n \rfloor -2$ edges in the worst case, where $n$ is the number of vertices in the input tournament. We then move on to study other popular tournament solutions and show that any algorithm for finding the Copeland set, the Slater set, the Markov set, the bipartisan set, the uncovered set, the Banks set, and the top cycle must query $Ω(n^2)$ edges in the worst case. On the positive side, we are able to circumvent our strong query complexity lower bound results by proving that, if the size of the top cycle of the input tournament is at most $k$, then we can find all the tournament solutions mentioned above by querying $O(nk + \frac{n\log n}{\log(1-\frac{1}{k})})$ edges only.