36.3GTMay 12
Risk-Sensitive Online Selection with Bounded AdaptivityHossein Nekouyan, Bo Sun, Raouf Boutaba et al.
Designing randomized online algorithms that perform reliably not only in expectation but also under unfavorable realizations of randomness is a fundamental challenge in online decision-making. In this paper, we study this challenge in online adversarial selection, where a decision maker allocates $k$ units of a resource to sequentially arriving buyers through posted prices. We focus on two intertwined considerations that are often overlooked simultaneously: tail-risk sensitivity and bounded adaptivity, where tail risk is measured using conditional value-at-risk (CVaR) and bounded adaptivity limits the number of allowable policy updates over time. Our main contribution is a correlated posted-price mechanism that uses a single random seed to coordinate pricing decisions across time. This correlation induces a monotonic ordering of pricing profiles across sample paths, improving lower-tail performance while respecting the adaptivity constraint. More broadly, our results highlight correlation as a mechanism for controlling tail risk in randomized online algorithms. Using this framework, we derive competitive guarantees for several regimes of the problem under both static and dynamic pricing. Our analysis develops a risk-sensitive randomized online primal-dual framework tailored to CVaR objectives and reveals a systematic trade-off between allowable adaptivity, risk sensitivity, and competitive performance. Experiments on real airline pricing data further illustrate the empirical impact of correlated pricing on welfare concentration and tail behavior.
36.3LGJun 3
Offline-to-Online Learning in Linear BanditsKushagra Chandak, Toshinori Kitamura, Xiaoqi Tan
We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.
AIOct 24, 2025
Computational Hardness of Reinforcement Learning with Partial $q^π$-RealizabilityShayan Karimi, Xiaoqi Tan
This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $q^π$-realizability. In this framework, the objective is to learn an $ε$-optimal policy with respect to a predefined policy set $Π$, under the assumption that all value functions for policies in $Π$ are linearly realizable. The assumptions of this framework are weaker than those in $q^π$-realizability but stronger than those in $q^*$-realizability, providing a practical model where function approximation naturally arises. We prove that learning an $ε$-optimal policy in this setting is computationally hard. Specifically, we establish NP-hardness under a parameterized greedy policy set (argmax) and show that - unless NP = RP - an exponential lower bound (in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those in $q^*$-realizability and suggest computational difficulty persists even when $Π$ is expanded beyond the optimal policy. To establish this, we reduce from two complexity problems, $δ$-Max-3SAT and $δ$-Max-3SAT(b), to instances of GLinear-$κ$-RL (greedy policy) and SLinear-$κ$-RL (softmax policy). Our findings indicate that positive computational results are generally unattainable in partial $q^π$-realizability, in contrast to $q^π$-realizability under a generative access model.
LGOct 3, 2025
Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and ConcentrabilityVolodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan
We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.
DSFeb 6, 2025
Knowing When to Stop Matters: A Unified Algorithm for Online Conversion under Horizon UncertaintyYanzhao Wang, Hasti Nourmohammadi Sigaroudi, Bo Sun et al.
This paper investigates the online conversion problem, which involves sequentially trading a divisible resource (e.g., energy) under dynamically changing prices to maximize profit. A key challenge in online conversion is managing decisions under horizon uncertainty, where the duration of trading is either known, revealed partway, or entirely unknown. We propose a unified algorithm that achieves optimal competitive guarantees across these horizon models, accounting for practical constraints such as box constraints, which limit the maximum allowable trade per step. Additionally, we extend the algorithm to a learning-augmented version, leveraging horizon predictions to adaptively balance performance: achieving near-optimal results when predictions are accurate while maintaining strong guarantees when predictions are unreliable. These results advance the understanding of online conversion under various degrees of horizon uncertainty and provide more practical strategies to address real world constraints.