Ciara Pike-Burke

LG
h-index14
19papers
297citations
Novelty56%
AI Score49

19 Papers

LGJul 21, 2022
Delayed Feedback in Generalised Linear Bandits Revisited

Benjamin Howson, Ciara Pike-Burke, Sarah Filippi

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback achieves a regret bound where the penalty for the delays is independent of the horizon. This result significantly improves upon existing work, where the best known regret bound has the delay penalty increasing with the horizon. We verify our theoretical results through experiments on simulated data.

OCFeb 22, 2023
Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

Emmeran Johnson, Ciara Pike-Burke, Patrick Rebeschini

Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, PMD algorithmically regularises the policy improvement step of PI. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor $γ$ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free $γ$-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the $γ$-rate is optimal for PMD methods as well as PI, and that the adaptive step-size is necessary for PMD to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.

MLFeb 1, 2023
Delayed Feedback in Kernel Bandits

Sattar Vakili, Danyal Ahmed, Alberto Bernacchia et al.

Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{Γ_k(T)T}+\mathbb{E}[τ])$ regret, where $T$ is the number of time steps, $Γ_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $τ$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(Γ_k(T)\sqrt{T}+\mathbb{E}[τ]Γ_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.

MLJul 3, 2023
Trading-Off Payments and Accuracy in Online Classification with Paid Stochastic Experts

Dirk van der Hoeven, Ciara Pike-Burke, Hao Qiu et al.

We investigate online classification with paid stochastic experts. Here, before making their prediction, each expert must be paid. The amount that we pay each expert directly influences the accuracy of their prediction through some unknown Lipschitz "productivity" function. In each round, the learner must decide how much to pay each expert and then make a prediction. They incur a cost equal to a weighted sum of the prediction error and upfront payments for all experts. We introduce an online learning algorithm whose total cost after $T$ rounds exceeds that of a predictor which knows the productivity of all experts in advance by at most $\mathcal{O}(K^2(\log T)\sqrt{T})$ where $K$ is the number of experts. In order to achieve this result, we combine Lipschitz bandits and online classification with surrogate losses. These tools allow us to improve upon the bound of order $T^{2/3}$ one would obtain in the standard Lipschitz bandit setting. Our algorithm is empirically evaluated on synthetic data

LGOct 2, 2023
Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

Emmeran Johnson, Ciara Pike-Burke, Patrick Rebeschini

We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries $n$ to the environment that is polynomial in the dimension $d$ of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in $K$ batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under $d$-dimensional linear function approximation, we establish $Ω(\log \log d)$ lower bounds on the number of batches $K$ required for sample-efficient algorithms with $n = O(poly(d))$ queries. Our results show that just having adaptivity ($K>1$) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning ($K=1$), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.

LGNov 1, 2025
Stochastic Shortest Path with Sparse Adversarial Costs

Emmeran Johnson, Alberto Rumi, Ciara Pike-Burke et al.

We study the adversarial Stochastic Shortest Path (SSP) problem with sparse costs under full-information feedback. In the known transition setting, existing bounds based on Online Mirror Descent (OMD) with negative-entropy regularization scale with $\sqrt{\log S A}$, where $SA$ is the size of the state-action space. While we show that this is optimal in the worst-case, this bound fails to capture the benefits of sparsity when only a small number $M \ll SA$ of state-action pairs incur cost. In fact, we also show that the negative-entropy is inherently non-adaptive to sparsity: it provably incurs regret scaling with $\sqrt{\log S}$ on sparse problems. Instead, we propose a family of $\ell_r$-norm regularizers ($r \in (1,2)$) that adapts to the sparsity and achieves regret scaling with $\sqrt{\log M}$ instead of $\sqrt{\log SA}$. We show this is optimal via a matching lower bound, highlighting that $M$ captures the effective dimension of the problem instead of $SA$. Finally, in the unknown transition setting the benefits of sparsity are limited: we prove that even on sparse problems, the minimax regret for any learner scales polynomially with $SA$.

LGFeb 13, 2025
When and why randomised exploration works (in linear bandits)

Marc Abeille, David Janz, Ciara Pike-Burke

We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound of the order $O(d\sqrt{n} \log(n))$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

MLJan 22, 2025
Fixed-Budget Change Point Identification in Piecewise Constant Bandits

Joseph Lazzaro, Ciara Pike-Burke

We study the piecewise constant bandit problem where the expected reward is a piecewise constant function with one change point (discontinuity) across the action space $[0,1]$ and the learner's aim is to locate the change point. Under the assumption of a fixed exploration budget, we provide the first non-asymptotic analysis of policies designed to locate abrupt changes in the mean reward function under bandit feedback. We study the problem under a large and small budget regime, and for both settings establish lower bounds on the error probability and provide algorithms with near matching upper bounds. Interestingly, our results show a separation in the complexity of the two regimes. We then propose a regime adaptive algorithm which is near optimal for both small and large budgets simultaneously. We complement our theoretical analysis with experimental results in simulated environments to support our findings.

LGOct 31, 2024
QuACK: A Multipurpose Queuing Algorithm for Cooperative $k$-Armed Bandits

Benjamin Howson, Sarah Filippi, Ciara Pike-Burke

We study the cooperative stochastic $k$-armed bandit problem, where a network of $m$ agents collaborate to find the optimal action. In contrast to most prior work on this problem, which focuses on extending a specific algorithm to the multi-agent setting, we provide a black-box reduction that allows us to extend any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions on the bandit environment, we prove that our reduction transfers the regret guarantees of the single-agent algorithm to the multi-agent setting. These guarantees are tight in subgaussian environments, in that using a near minimax optimal single-player algorithm is near minimax optimal in the multi-player setting up to an additive graph-dependent quantity. Our reduction and theoretical results are also general, and apply to many different bandit settings. By plugging in appropriate single-player algorithms, we can easily develop provably efficient algorithms for many multi-player settings such as heavy-tailed bandits, duelling bandits and bandits with local differential privacy, among others. Experimentally, our approach is competitive with or outperforms specialised multi-agent algorithms.

LGJul 30, 2025
Locally Differentially Private Thresholding Bandits

Annalisa Barbara, Joseph Lazzaro, Ciara Pike-Burke

This work investigates the impact of ensuring local differential privacy in the thresholding bandit problem. We consider both the fixed budget and fixed confidence settings. We propose methods that utilize private responses, obtained through a Bernoulli-based differentially private mechanism, to identify arms with expected rewards exceeding a predefined threshold. We show that this procedure provides strong privacy guarantees and derive theoretical performance bounds on the proposed algorithms. Additionally, we present general lower bounds that characterize the additional loss incurred by any differentially private mechanism, and show that the presented algorithms match these lower bounds up to poly-logarithmic factors. Our results provide valuable insights into privacy-preserving decision-making frameworks in bandit problems.

MLJul 11, 2025
Fixed-Confidence Multiple Change Point Identification under Bandit Feedback

Joseph Lazzaro, Ciara Pike-Burke

Piecewise constant functions describe a variety of real-world phenomena in domains ranging from chemistry to manufacturing. In practice, it is often required to confidently identify the locations of the abrupt changes in these functions as quickly as possible. For this, we introduce a fixed-confidence piecewise constant bandit problem. Here, we sequentially query points in the domain and receive noisy evaluations of the function under bandit feedback. We provide instance-dependent lower bounds for the complexity of change point identification in this problem. These lower bounds illustrate that an optimal method should focus its sampling efforts adjacent to each of the change points, and the number of samples around each change point should be inversely proportional to the magnitude of the change. Building on this, we devise a simple and computationally efficient variant of Track-and-Stop and prove that it is asymptotically optimal in many regimes. We support our theoretical findings with experimental results in synthetic environments demonstrating the efficiency of our method.

LGJun 24, 2025
On the necessity of adaptive regularisation:Optimal anytime online learning on $\boldsymbol{\ell_p}$-balls

Emmeran Johnson, David Martínez-Rubio, Ciara Pike-Burke et al.

We study online convex optimization on $\ell_p$-balls in $\mathbb{R}^d$ for $p > 2$. While always sub-linear, the optimal regret exhibits a shift between the high-dimensional setting ($d > T$), when the dimension $d$ is greater than the time horizon $T$ and the low-dimensional setting ($d \leq T$). We show that Follow-the-Regularised-Leader (FTRL) with time-varying regularisation which is adaptive to the dimension regime is anytime optimal for all dimension regimes. Motivated by this, we ask whether it is possible to obtain anytime optimality of FTRL with fixed non-adaptive regularisation. Our main result establishes that for separable regularisers, adaptivity in the regulariser is necessary, and that any fixed regulariser will be sub-optimal in one of the two dimension regimes. Finally, we provide lower bounds which rule out sub-linear regret bounds for the linear bandit problem in sufficiently high-dimension for all $\ell_p$-balls with $p \geq 1$.

LGJun 4, 2025
Learning Fair And Effective Points-Based Rewards Programs

Chamsi Hssaine, Yichun Hu, Ciara Pike-Burke

Points-based rewards programs are a prevalent way to incentivize customer loyalty; in these programs, customers who make repeated purchases from a seller accumulate points, working toward eventual redemption of a free reward. These programs have recently come under scrutiny due to accusations of unfair practices in their implementation. Motivated by these concerns, we study the problem of fairly designing points-based rewards programs, with a focus on two obstacles that put fairness at odds with their effectiveness. First, due to customer heterogeneity, the seller should set different redemption thresholds for different customers to generate high revenue. Second, the relationship between customer behavior and the number of accumulated points is typically unknown; this requires experimentation which may unfairly devalue customers' previously earned points. We first show that an individually fair rewards program that uses the same redemption threshold for all customers suffers a loss in revenue of at most a factor of $1+\ln 2$, compared to the optimal personalized strategy that differentiates between customers. We then tackle the problem of designing temporally fair learning algorithms in the presence of demand uncertainty. Toward this goal, we design a learning algorithm that limits the risk of point devaluation due to experimentation by only changing the redemption threshold $O(\log T)$ times, over a horizon of length $T$. This algorithm achieves the optimal (up to polylogarithmic factors) $\widetilde{O}(\sqrt{T})$ regret in expectation. We then modify this algorithm to only ever decrease redemption thresholds, leading to improved fairness at a cost of only a constant factor in regret. Extensive numerical experiments show the limited value of personalization in average-case settings, in addition to demonstrating the strong practical performance of our proposed learning algorithms.

MLNov 25, 2021
Bandit problems with fidelity rewards

Gábor Lugosi, Ciara Pike-Burke, Pierre-André Savalle

The fidelity bandits problem is a variant of the $K$-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how 'loyal' the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every $m$ plays of an arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret.

LGNov 15, 2021
Optimism and Delays in Episodic Reinforcement Learning

Benjamin Howson, Ciara Pike-Burke, Sarah Filippi

There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.

LGOct 15, 2020
Local Differential Privacy for Regret Minimization in Reinforcement Learning

Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke et al.

Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side. We formulate this notion of privacy for RL by leveraging the local differential privacy (LDP) framework. We establish a lower bound for regret minimization in finite-horizon MDPs with LDP guarantees which shows that guaranteeing privacy has a multiplicative effect on the regret. This result shows that while LDP is an appealing notion of privacy, it makes the learning problem significantly more complex. Finally, we present an optimistic algorithm that simultaneously satisfies $\varepsilon$-LDP requirements, and achieves $\sqrt{K}/\varepsilon$ regret in any finite-horizon MDP after $K$ episodes, matching the lower bound dependency on the number of episodes $K$.

LGJul 3, 2020
A Unifying View of Optimism in Episodic Reinforcement Learning

Gergely Neu, Ciara Pike-Burke

The principle of optimism in the face of uncertainty underpins many theoretically successful reinforcement learning algorithms. In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs an optimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address largescale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods.

MLOct 31, 2019
Recovering Bandits

Ciara Pike-Burke, Steffen Grünewälder

We study the recovering bandits problem, a variant of the stochastic multi-armed bandit problem where the expected reward of each arm varies according to some unknown function of the time since the arm was last played. While being a natural extension of the classical bandit problem that arises in many real-world settings, this variation is accompanied by significant difficulties. In particular, methods need to plan ahead and estimate many more quantities than in the classical bandit setting. In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. We also discuss different regret definitions that let us quantify the performance of the methods. To improve computational efficiency of the methods, we provide an optimistic planning approximation. We complement these discussions with regret bounds and empirical studies.

MLSep 20, 2017
Bandits with Delayed, Aggregated Anonymous Feedback

Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari et al.

We study a variant of the stochastic $K$-armed bandit problem, which we call "bandits with delayed, aggregated anonymous feedback". In this problem, when the player pulls an arm, a reward is generated, however it is not immediately observed. Instead, at the end of each round the player observes only the sum of a number of previously generated rewards which happen to arrive in the given round. The rewards are stochastically delayed and due to the aggregated nature of the observations, the information of which arm led to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.