Taira Tsuchiya

LG
Semantic Scholar Profile
h-index48
24papers
173citations
Novelty58%
AI Score55

24 Papers

LGJun 2, 2022
Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs

Shinji Ito, Taira Tsuchiya, Junya Honda

This study considers online learning with general directed feedback graphs. For this problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds for adversarial environments as well as poly-logarithmic regret bounds for stochastic environments. As Alon et al. [2015] have shown, tight regret bounds depend on the structure of the feedback graph: strongly observable graphs yield minimax regret of $\tildeΘ( α^{1/2} T^{1/2} )$, while weakly observable graphs induce minimax regret of $\tildeΘ( δ^{1/3} T^{2/3} )$, where $α$ and $δ$, respectively, represent the independence number of the graph and the domination number of a certain portion of the graph. Our proposed algorithm for strongly observable graphs has a regret bound of $\tilde{O}( α^{1/2} T^{1/2} ) $ for adversarial environments, as well as of $ {O} ( \frac{α(\ln T)^3 }{Δ_{\min}} ) $ for stochastic environments, where $Δ_{\min}$ expresses the minimum suboptimality gap. This result resolves an open question raised by Erez and Koren [2021]. We also provide an algorithm for weakly observable graphs that achieves a regret bound of $\tilde{O}( δ^{1/3}T^{2/3} )$ for adversarial environments and poly-logarithmic regret for stochastic environments. The proposed algorithms are based on the follow-the-regularized-leader approach combined with newly designed update rules for learning rates.

MLJun 9, 2022
Minimax Optimal Algorithms for Fixed-Budget Best Arm Identification

Junpei Komiyama, Taira Tsuchiya, Junya Honda

We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optimal rate as a result of an optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the probability of misidentification, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To address this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).

LGJun 14, 2022
Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds

Shinji Ito, Taira Tsuchiya, Junya Honda

This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both stochastic and adversarial settings. In stochastic settings, some existing BOBW algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: Δ_i>0} \frac{\log T}{Δ_i})$ for suboptimality gap $Δ_i$ of arm $i$ and time horizon $T$. As Audibert et al. [2007] have shown, however, that the performance can be improved in stochastic environments with low-variance arms. In fact, they have provided a stochastic MAB algorithm with gap-variance-dependent regret bounds of $O(\sum_{i: Δ_i>0} (\frac{σ_i^2}{Δ_i} + 1) \log T )$ for loss variance $σ_i^2$ of arm $i$. In this paper, we propose the first BOBW algorithm with gap-variance-dependent bounds, showing that the variance information can be used even in the possibly adversarial environment. Further, the leading constant factor in our gap-variance dependent bound is only (almost) twice the value for the lower bound. Additionally, the proposed algorithm enjoys multiple data-dependent regret bounds in adversarial settings and works well in stochastic settings with adversarial corruptions. The proposed algorithm is based on the follow-the-regularized-leader method and employs adaptive learning rates that depend on the empirical prediction error of the loss, which leads to gap-variance-dependent regret bounds reflecting the variance of the arms.

LGJul 29, 2022
Best-of-Both-Worlds Algorithms for Partial Monitoring

Taira Tsuchiya, Shinji Ito, Junya Honda

This study considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is $O(m^2 k^4 \log(T) \log(k_Π T) / Δ_{\min})$ in the stochastic regime and $O(m k^{2/3} \sqrt{T \log(T) \log k_Π})$ in the adversarial regime, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $Δ_{\min}$ is the minimum suboptimality gap, and $k_Π$ is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is $O(c_{\mathcal{G}}^2 \log(T) \log(k_Π T) / Δ_{\min}^2)$ in the stochastic regime and $O((c_{\mathcal{G}}^2 \log(T) \log(k_Π T))^{1/3} T^{2/3})$ in the adversarial regime, where $c_{\mathcal{G}}$ is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-the-regularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.

LGFeb 6
Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret

Shinji Ito, Haipeng Luo, Arnab Maiti et al.

Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that $Ω(\sqrt{T})$ external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy -- a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent's action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves $O(c \log T)$ instance-dependent regret with a game-dependent parameter $c$. Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form $O(c' \log T)$ for a different game-dependent parameter $c'$ that could potentially be much smaller than $c$. Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds.

GTFeb 12
Scale-Invariant Fast Convergence in Games

Taira Tsuchiya, Haipeng Luo, Shinji Ito

Scale-invariance in games has recently emerged as a widely valued desirable property. Yet, almost all fast convergence guarantees in learning in games require prior knowledge of the utility scale. To address this, we develop learning dynamics that achieve fast convergence while being both scale-free, requiring no prior information about utilities, and scale-invariant, remaining unchanged under positive rescaling of utilities. For two-player zero-sum games, we obtain scale-free and scale-invariant dynamics with external regret bounded by $\tilde{O}(A_{\mathrm{diff}})$, where $A_{\mathrm{diff}}$ is the payoff range, which implies an $\tilde{O}(A_{\mathrm{diff}} / T)$ convergence rate to Nash equilibrium after $T$ rounds. For multiplayer general-sum games with $n$ players and $m$ actions, we obtain scale-free and scale-invariant dynamics with swap regret bounded by $O(U_{\mathrm{max}} \log T)$, where $U_{\mathrm{max}}$ is the range of the utilities, ignoring the dependence on the number of players and actions. This yields an $O(U_{\mathrm{max}} \log T / T)$ convergence rate to correlated equilibrium. Our learning dynamics are based on optimistic follow-the-regularized-leader with an adaptive learning rate that incorporates the squared path length of the opponents' gradient vectors, together with a new stopping-time analysis that exploits negative terms in regret bounds without scale-dependent tuning. For general-sum games, scale-free learning is enabled also by a technique called doubling clipping, which clips observed gradients based on past observations.

LGFeb 2
Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

Mingyi Li, Taira Tsuchiya, Kenji Yamanishi

This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

LGDec 24, 2023
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

Yuko Kuroki, Alberto Rumi, Taira Tsuchiya et al.

We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{Δ_{\min}}$, where $Δ_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{Λ^*})$ bound, where $L^*$ is the cumulative loss of the best action and $Λ^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.

LGMar 1, 2024
Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds

Shinji Ito, Taira Tsuchiya, Junya Honda

Follow-The-Regularized-Leader (FTRL) is known as an effective and versatile approach in online learning, where appropriate choice of the learning rate is crucial for smaller regret. To this end, we formulate the problem of adjusting FTRL's learning rate as a sequential decision-making problem and introduce the framework of competitive analysis. We establish a lower bound for the competitive ratio and propose update rules for learning rate that achieves an upper bound within a constant factor of this lower bound. Specifically, we illustrate that the optimal competitive ratio is characterized by the (approximate) monotonicity of components of the penalty term, showing that a constant competitive ratio is achievable if the components of the penalty term form a monotonically non-increasing sequence, and derive a tight competitive ratio when penalty terms are $ξ$-approximately monotone non-increasing. Our proposed update rule, referred to as \textit{stability-penalty matching}, also facilitates constructing the Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our result contributes to achieve tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.

LGFeb 13, 2024
Online Structured Prediction with Fenchel--Young Losses and Improved Surrogate Regret for Online Multiclass Classification with Logistic Loss

Shinsaku Sakaue, Han Bao, Taira Tsuchiya et al.

This paper studies online structured prediction with full-information feedback. For online multiclass classification, Van der Hoeven (2020) established \emph{finite} surrogate regret bounds, which are independent of the time horizon, by introducing an elegant \emph{exploit-the-surrogate-gap} framework. However, this framework has been limited to multiclass classification primarily because it relies on a classification-specific procedure for converting estimated scores to outputs. We extend the exploit-the-surrogate-gap framework to online structured prediction with \emph{Fenchel--Young losses}, a large family of surrogate losses that includes the logistic loss for multiclass classification as a special case, obtaining finite surrogate regret bounds in various structured prediction problems. To this end, we propose and analyze \emph{randomized decoding}, which converts estimated scores to general structured outputs. Moreover, by applying our decoding to online multiclass classification with the logistic loss, we obtain a surrogate regret bound of $O(\| \mathbf{U} \|_\mathrm{F}^2)$, where $\mathbf{U}$ is the best offline linear estimator and $\| \cdot \|_\mathrm{F}$ denotes the Frobenius norm. This bound is tight up to logarithmic factors and improves the previous bound of $O(d\| \mathbf{U} \|_\mathrm{F}^2)$ due to Van der Hoeven (2020) by a factor of $d$, the number of classes.

LGFeb 24, 2025
Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback

Shinji Ito, Haipeng Luo, Taira Tsuchiya et al.

No-regret self-play learning dynamics have become one of the premier ways to solve large-scale games in practice. Accelerating their convergence via improving the regret of the players over the naive $O(\sqrt{T})$ bound after $T$ rounds has been extensively studied in recent years, but almost all studies assume access to exact gradient feedback. We address the question of whether acceleration is possible under bandit feedback only and provide an affirmative answer for two-player zero-sum normal-form games. Specifically, we show that if both players apply the Tsallis-INF algorithm of Zimmert and Seldin (2018, arXiv:1807.07623), then their regret is at most $O(c_1 \log T + \sqrt{c_2 T})$, where $c_1$ and $c_2$ are game-dependent constants that characterize the difficulty of learning -- $c_1$ resembles the complexity of learning a stochastic multi-armed bandit instance and depends inversely on some gap measures, while $c_2$ can be much smaller than the number of actions when the Nash equilibria have a small support or are close to the boundary. In particular, for the case when a pure strategy Nash equilibrium exists, $c_2$ becomes zero, leading to an optimal instance-dependent regret bound as we show. We additionally prove that in this case, our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample complexity.

LGJan 23, 2025
Revisiting Online Learning Approach to Inverse Linear Optimization: A Fenchel$-$Young Loss Perspective and Gap-Dependent Regret Analysis

Shinsaku Sakaue, Han Bao, Taira Tsuchiya

This paper revisits the online learning approach to inverse linear optimization studied by Bärmann et al. (2017), where the goal is to infer an unknown linear objective function of an agent from sequential observations of the agent's input-output pairs. First, we provide a simple understanding of the online learning approach through its connection to online convex optimization of \emph{Fenchel--Young losses}. As a byproduct, we present an offline guarantee on the \emph{suboptimality loss}, which measures how well predicted objectives explain the agent's choices, without assuming the optimality of the agent's choices. Second, assuming that there is a gap between optimal and suboptimal objective values in the agent's decision problems, we obtain an upper bound independent of the time horizon $T$ on the sum of suboptimality and \emph{estimate losses}, where the latter measures the quality of solutions recommended by predicted objectives. Interestingly, our gap-dependent analysis achieves a faster rate than the standard $O(\sqrt{T})$ regret bound by exploiting structures specific to inverse linear optimization, even though neither the loss functions nor their domains enjoy desirable properties, such as strong convexity.

LGFeb 13, 2024
Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

Taira Tsuchiya, Shinji Ito, Junya Honda

Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, \emph{exploration by optimization} (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2 m^2 \log T / Δ_a)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds, $a^*$ is an optimal action, and $Δ_a$ is the suboptimality gap for action $a$. This bound is roughly $Θ(k^2 \log T)$ times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic bound.

GTDec 10, 2024
Corrupted Learning Dynamics in Games

Taira Tsuchiya, Shinji Ito, Haipeng Luo

Learning in games refers to scenarios where multiple players interact in a shared environment, each aiming to minimize their regret. An equilibrium can be computed at a fast rate of $O(1/T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL). However, this acceleration is limited to the honest regime, in which all players adhere to a prescribed algorithm -- a situation that may not be realistic in practice. To address this issue, we present corrupted learning dynamics that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm. First, in two-player zero-sum corrupted games, we provide learning dynamics for which the external regret of $x$-player (and similarly for $y$-player) is roughly bounded by $O(\log (m_x m_y) + \sqrt{\hat{C}_y} + \hat{C}_x)$, where $m_x$ and $m_y$ denote the number of actions of $x$- and $y$-players, respectively, and $\hat{C}_x$ and $\hat{C}_y$ represent their cumulative deviations. We then extend our approach to multi-player general-sum corrupted games, providing learning dynamics for which the swap regret of player $i$ is bounded by $O(\log T + \sqrt{\sum_{k} \hat{C}_k \log T} + \hat{C}_i)$ ignoring dependence on the number of players and actions, where $\hat{C}_i$ is the cumulative deviation of player $i$ from the prescribed algorithm. Our learning dynamics are agnostic to the levels of corruption. A key technical contribution is a new analysis that ensures the stability of a Markov chain under a new adaptive learning rate, thereby allowing us to achieve the desired bound in the corrupted regime while matching the best existing bound in the honest regime. Notably, our framework can be extended to address not only corruption in strategies but also corruption in the observed expected utilities, and we provide several matching lower bounds.

LGOct 20, 2025
Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Shinji Ito, Kevin Jamieson, Haipeng Luo et al.

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.

LGFeb 26, 2025
Bandit and Delayed Feedback in Online Structured Prediction

Yuki Shibukawa, Taira Tsuchiya, Shinsaku Sakaue et al.

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, i.e. the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.

LGFeb 20, 2024
Fast Rates in Stochastic Online Convex Optimization by Exploiting the Curvature of Feasible Sets

Taira Tsuchiya, Shinji Ito

In this work, we explore online convex optimization (OCO) and introduce a new condition and analysis that provides fast rates by exploiting the curvature of feasible sets. In online linear optimization, it is known that if the average gradient of loss functions exceeds a certain threshold, the curvature of feasible sets can be exploited by the follow-the-leader (FTL) algorithm to achieve a logarithmic regret. This study reveals that algorithms adaptive to the curvature of loss functions can also leverage the curvature of feasible sets. In particular, we first prove that if an optimal decision is on the boundary of a feasible set and the gradient of an underlying loss function is non-zero, then the algorithm achieves a regret bound of $O(ρ\log T)$ in stochastic environments. Here, $ρ> 0$ is the radius of the smallest sphere that includes the optimal decision and encloses the feasible set. Our approach, unlike existing ones, can work directly with convex loss functions, exploiting the curvature of loss functions simultaneously, and can achieve the logarithmic regret only with a local property of feasible sets. Additionally, the algorithm achieves an $O(\sqrt{T})$ regret even in adversarial environments, in which FTL suffers an $Ω(T)$ regret, and achieves an $O(ρ\log T + \sqrt{C ρ\log T})$ regret in corrupted stochastic environments with corruption level $C$. Furthermore, by extending our analysis, we establish a matching regret upper bound of $O\Big(T^{\frac{q-2}{2(q-1)}} (\log T)^{\frac{q}{2(q-1)}}\Big)$ for $q$-uniformly convex feasible sets, where uniformly convex sets include strongly convex sets and $\ell_p$-balls for $p \in [2,\infty)$. This bound bridges the gap between the $O(\log T)$ bound for strongly convex sets~($q=2$) and the $O(\sqrt{T})$ bound for non-curved sets~($q\to\infty$).

LGOct 13, 2025
Tight Regret Upper and Lower Bounds for Optimistic Hedge in Two-Player Zero-Sum Games

Taira Tsuchiya

In two-player zero-sum games, the learning dynamic based on optimistic Hedge achieves one of the best-known regret upper bounds among strongly-uncoupled learning dynamics. With an appropriately chosen learning rate, the social and individual regrets can be bounded by $O(\log(mn))$ in terms of the numbers of actions $m$ and $n$ of the two players. This study investigates the optimality of the dependence on $m$ and $n$ in the regret of optimistic Hedge. To this end, we begin by refining existing regret analysis and show that, in the strongly-uncoupled setting where the opponent's number of actions is known, both the social and individual regret bounds can be improved to $O(\sqrt{\log m \log n})$. In this analysis, we express the regret upper bound as an optimization problem with respect to the learning rates and the coefficients of certain negative terms, enabling refined analysis of the leading constants. We then show that the existing social regret bound as well as these new social and individual regret upper bounds cannot be further improved for optimistic Hedge by providing algorithm-dependent individual regret lower bounds. Importantly, these social regret upper and lower bounds match exactly including the constant factor in the leading term. Finally, building on these results, we improve the last-iterate convergence rate and the dynamic regret of a learning dynamic based on optimistic Hedge, and complement these bounds with algorithm-dependent dynamic regret lower bounds that match the improved bounds.

LGJul 15, 2025
Reinforcement Learning from Adversarial Preferences in Tabular MDPs

Taira Tsuchiya, Shinji Ito, Haipeng Luo

We introduce a new framework of episodic tabular Markov decision processes (MDPs) with adversarial preferences, which we refer to as preference-based MDPs (PbMDPs). Unlike standard episodic MDPs with adversarial losses, where the numerical value of the loss is directly observed, in PbMDPs the learner instead observes preferences between two candidate arms, which represent the choices being compared. In this work, we focus specifically on the setting where the reward functions are determined by Borda scores. We begin by establishing a regret lower bound for PbMDPs with Borda scores. As a preliminary step, we present a simple instance to prove a lower bound of $Ω(\sqrt{HSAT})$ for episodic MDPs with adversarial losses, where $H$ is the number of steps per episode, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. Leveraging this construction, we then derive a regret lower bound of $Ω( (H^2 S K)^{1/3} T^{2/3} )$ for PbMDPs with Borda scores, where $K$ is the number of arms. Next, we develop algorithms that achieve a regret bound of order $T^{2/3}$. We first propose a global optimization approach based on online linear optimization over the set of all occupancy measures, achieving a regret bound of $\tilde{O}((H^2 S^2 K)^{1/3} T^{2/3} )$ under known transitions. However, this approach suffers from suboptimal dependence on the potentially large number of states $S$ and computational inefficiency. To address this, we propose a policy optimization algorithm whose regret is roughly bounded by $\tilde{O}( (H^6 S K^5)^{1/3} T^{2/3} )$ under known transitions, and further extend the result to the unknown-transition setting.

LGJan 24, 2025
Online Inverse Linear Optimization: Efficient Logarithmic-Regret Algorithm, Robustness to Suboptimality, and Lower Bound

Shinsaku Sakaue, Taira Tsuchiya, Han Bao et al.

In online inverse linear optimization, a learner observes time-varying sets of feasible actions and an agent's optimal actions, selected by solving linear optimization over the feasible actions. The learner sequentially makes predictions of the agent's true linear objective function, and their quality is measured by the regret, the cumulative gap between optimal objective values and those achieved by following the learner's predictions. A seminal work by Bärmann et al. (2017) obtained a regret bound of $O(\sqrt{T})$, where $T$ is the time horizon. Subsequently, the regret bound has been improved to $O(n^4 \ln T)$ by Besbes et al. (2021, 2025) and to $O(n \ln T)$ by Gollapudi et al. (2021), where $n$ is the dimension of the ambient space of objective vectors. However, these logarithmic-regret methods are highly inefficient when $T$ is large, as they need to maintain regions specified by $O(T)$ constraints, which represent possible locations of the true objective vector. In this paper, we present the first logarithmic-regret method whose per-round complexity is independent of $T$; indeed, it achieves the best-known bound of $O(n \ln T)$. Our method is strikingly simple: it applies the online Newton step (ONS) to appropriate exp-concave loss functions. Moreover, for the case where the agent's actions are possibly suboptimal, we establish a regret bound of $O(n\ln T + \sqrt{Δ_T n\ln T})$, where $Δ_T$ is the cumulative suboptimality of the agent's actions. This bound is achieved by using MetaGrad, which runs ONS with $Θ(\ln T)$ different learning rates in parallel. We also present a lower bound of $Ω(n)$, showing that the $O(n\ln T)$ bound is tight up to an $O(\ln T)$ factor.

SYFeb 15, 2024
Online Control of Linear Systems under Unbounded Noise

Kaito Ito, Taira Tsuchiya

This paper investigates the problem of controlling a linear system under possibly unbounded stochastic noise with unknown convex cost functions, known as an online control problem. In contrast to the existing work, which assumes the boundedness of noise, we show that an $ \tilde{O}(\sqrt{T}) $ high-probability regret can be achieved under unbounded noise, where $ T $ denotes the time horizon. Notably, the noise is only required to have a finite fourth moment. Moreover, when the costs are strongly convex and the noise is sub-Gaussian, we establish an $ O({\rm poly} (\log T)) $ regret bound.

LGMay 26, 2023
Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds

Taira Tsuchiya, Shinji Ito, Junya Honda

Adaptivity to the difficulties of a problem is a key property in sequential decision-making problems to broaden the applicability of algorithms. Follow-the-regularized-leader (FTRL) has recently emerged as one of the most promising approaches for obtaining various types of adaptivity in bandit problems. Aiming to further generalize this adaptivity, we develop a generic adaptive learning rate, called stability-penalty-adaptive (SPA) learning rate for FTRL. This learning rate yields a regret bound jointly depending on stability and penalty of the algorithm, into which the regret of FTRL is typically decomposed. With this result, we establish several algorithms with three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds (BOBW). Despite the fact that sparsity appears frequently in real problems, existing sparse multi-armed bandit algorithms with $k$-arms assume that the sparsity level $s \leq k$ is known in advance, which is often not the case in real-world scenarios. To address this issue, we first establish $s$-agnostic algorithms with regret bounds of $\tilde{O}(\sqrt{sT})$ in the adversarial regime for $T$ rounds, which matches the existing lower bound up to a logarithmic factor. Meanwhile, BOBW algorithms aim to achieve a near-optimal regret in both the stochastic and adversarial regimes. Leveraging the SPA learning rate and the technique for $s$-agnostic algorithms combined with a new analysis to bound the variation in FTRL output in response to changes in a regularizer, we establish the first BOBW algorithm with a sparsity-dependent bound. Additionally, we explore partial monitoring and demonstrate that the proposed SPA learning rate framework allows us to achieve a game-dependent bound and the BOBW simultaneously.

MLJun 17, 2020
Analysis and Design of Thompson Sampling for Stochastic Partial Monitoring

Taira Tsuchiya, Junya Honda, Masashi Sugiyama

We investigate finite stochastic partial monitoring, which is a general model for sequential learning with limited feedback. While Thompson sampling is one of the most promising algorithms on a variety of online decision-making problems, its properties for stochastic partial monitoring have not been theoretically investigated, and the existing algorithm relies on a heuristic approximation of the posterior distribution. To mitigate these problems, we present a novel Thompson-sampling-based algorithm, which enables us to exactly sample the target parameter from the posterior distribution. Besides, we prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $\mathrm{O}(\log T)$ for a linearized variant of the problem with local observability. This result is the first regret bound of Thompson sampling for partial monitoring, which also becomes the first logarithmic regret bound of Thompson sampling for linear bandits.

LGJan 31, 2019
Semi-Supervised Ordinal Regression Based on Empirical Risk Minimization

Taira Tsuchiya, Nontawat Charoenphakdee, Issei Sato et al.

Ordinal regression is aimed at predicting an ordinal class label. In this paper, we consider its semi-supervised formulation, in which we have unlabeled data along with ordinal-labeled data to train an ordinal regressor. There are several metrics to evaluate the performance of ordinal regression, such as the mean absolute error, mean zero-one error, and mean squared error. However, the existing studies do not take the evaluation metric into account, have a restriction on the model choice, and have no theoretical guarantee. To overcome these problems, we propose a novel generic framework for semi-supervised ordinal regression based on the empirical risk minimization principle that is applicable to optimizing all of the metrics mentioned above. Besides, our framework has flexible choices of models, surrogate losses, and optimization algorithms without the common geometric assumption on unlabeled data such as the cluster assumption or manifold assumption. We further provide an estimation error bound to show that our risk estimator is consistent. Finally, we conduct experiments to show the usefulness of our framework.