Hirota Kinoshita

h-index16
2papers

2 Papers

LGFeb 2
A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

Akifumi Wachi, Hirota Kinoshita, Shokichi Takakura et al.

Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $ξ:= H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $ξ$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($ξ\to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($ξ=Θ(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($ξ\to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $ξ\in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.

GTNov 30, 2024
Achieving PAC Guarantees in Mechanism Design through Multi-Armed Bandits

Takayuki Osogami, Hirota Kinoshita, Segev Wasserkrug

We analytically derive a class of optimal solutions to a linear program (LP) for automated mechanism design that satisfies efficiency, incentive compatibility, strong budget balance (SBB), and individual rationality (IR), where SBB and IR are enforced in expectation. These solutions can be expressed using a set of essential variables whose cardinality is exponentially smaller than the total number of variables in the original formulation. However, evaluating a key term in the solutions requires exponentially many optimization steps as the number of players $N$ increases. We address this by translating the evaluation of this term into a multi-armed bandit (MAB) problem and develop a probably approximately correct (PAC) estimator with asymptotically optimal sample complexity. This MAB-based approach reduces the optimization complexity from exponential to $O(N\log N)$. Numerical experiments confirm that our method efficiently computes mechanisms with the target properties, scaling to problems with up to $N=128$ players -- substantially improving over prior work.