SYOct 24, 2012
Distributed Optimization: Convergence Conditions from a Dynamical System PerspectiveGuodong Shi, Alexandre Proutiere, Karl Henrik Johansson
This paper explores the fundamental properties of distributed minimization of a sum of functions with each function only known to one node, and a pre-specified level of node knowledge and computational capacity. We define the optimization information each node receives from its objective function, the neighboring information each node receives from its neighbors, and the computational capacity each node can take advantage of in controlling its state. It is proven that there exist a neighboring information way and a control law that guarantee global optimal consensus if and only if the solution sets of the local objective functions admit a nonempty intersection set for fixed strongly connected graphs. Then we show that for any tolerated error, we can find a control law that guarantees global optimal consensus within this error for fixed, bidirectional, and connected graphs under mild conditions. For time-varying graphs, we show that optimal consensus can always be achieved as long as the graph is uniformly jointly strongly connected and the nonempty intersection condition holds. The results illustrate that nonempty intersection for the local optimal solution sets is a critical condition for successful distributed optimization for a large class of algorithms.
LGApr 5, 2023
Conformal Off-Policy Evaluation in Markov Decision ProcessesDaniele Foffano, Alessio Russo, Alexandre Proutiere
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.
NIApr 14, 2022
Measurement-based Admission Control in Sliced Networks: A Best Arm Identification ApproachSimon Lindståhl, Alexandre Proutiere, Andreas Johnsson
In sliced networks, the shared tenancy of slices requires adaptive admission control of data flows, based on measurements of network resources. In this paper, we investigate the design of measurement-based admission control schemes, deciding whether a new data flow can be admitted and in this case, on which slice. The objective is to devise a joint measurement and decision strategy that returns a correct decision (e.g., the least loaded slice) with a certain level of confidence while minimizing the measurement cost (the number of measurements made before committing to the decision). We study the design of such strategies for several natural admission criteria specifying what a correct decision is. For each of these criteria, using tools from best arm identification in bandits, we first derive an explicit information-theoretical lower bound on the cost of any algorithm returning the correct decision with fixed confidence. We then devise a joint measurement and decision strategy achieving this theoretical limit. We compare empirically the measurement costs of these strategies, and compare them both to the lower bounds as well as a naive measurement scheme. We find that our algorithm significantly outperforms the naive scheme (by a factor $2-8$).
MLNov 28, 2022
On the Sample Complexity of Representation Learning in Multi-task Bandits with Global and Local structureAlessio Russo, Alexandre Proutiere
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems. Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor). The objective is to learn the optimal (representation, predictor)-pair for each task, under the assumption that the optimal representation is common to all tasks. Within this framework, efficient learning algorithms should transfer knowledge across tasks. We consider the best-arm identification problem for a fixed confidence, where, in each round, the learner actively selects both a task, and an arm, and observes the corresponding reward. We derive instance-specific sample complexity lower bounds satisfied by any $(δ_G,δ_H)$-PAC algorithm (such an algorithm identifies the best representation with probability at least $1-δ_G$, and the best predictor for a task with probability at least $1-δ_H$). We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(G\log(1/δ_G)+ X\log(1/δ_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors. By comparison, this scaling is significantly better than the classical best-arm identification algorithm that scales as $HGX\log(1/δ)$.
MLAug 23, 2023
On Universally Optimal Algorithms for A/B TestingPo-An Wang, Kaito Ariu, Alexandre Proutiere
We study the problem of best-arm identification with fixed budget in stochastic multi-armed bandits with Bernoulli rewards. For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the {\it uniform sampling} algorithm) in all instances, and that (ii) strictly outperforms uniform sampling on at least one instance. In short, there is no algorithm better than the uniform sampling algorithm. To establish this result, we first introduce the natural class of {\it consistent} and {\it stable} algorithms, and show that any algorithm that performs as well as the uniform sampling algorithm in all instances belongs to this class. The proof then proceeds by deriving a lower bound on the error rate satisfied by any consistent and stable algorithm, and by showing that the uniform sampling algorithm matches this lower bound. Our results provide a solution to the two open problems presented in \citep{qin2022open}. For the general problem with more than two arms, we provide a first set of results. We characterize the asymptotic error rate of the celebrated Successive Rejects (SR) algorithm \citep{audibert2010best} and show that, surprisingly, the uniform sampling algorithm outperforms the SR algorithm in some instances.
LGAug 11, 2022
Best Policy Identification in Linear MDPsJerome Taupin, Yassir Jedra, Alexandre Proutiere
We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-δ$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by ${\cal O}({\frac{d}{(\varepsilon+Δ)^2}} (\log(\frac{1}δ)+d))$ where $Δ$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $δ$), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.
ITMay 11
Near-optimal Rank Adaptive Inference of High Dimensional MatricesFrédéric Zheng, Yassir Jedra, Alexandre Proutiere
We address the problem of estimating a high-dimensional matrix from linear measurements, with a focus on designing optimal rank-adaptive algorithms. These algorithms infer the matrix by estimating its singular values and the corresponding singular vectors up to an effective rank, adaptively determined based on the data. We establish instance-specific lower bounds for the sample complexity of such algorithms, uncovering fundamental trade-offs in selecting the effective rank: balancing the precision of estimating a subset of singular values against the approximation cost incurred for the remaining ones. Our analysis identifies how the optimal effective rank depends on the matrix being estimated, the sample size, and the noise level. We propose an algorithm that combines a Least-Squares estimator with a universal singular value thresholding procedure. We provide finite-sample error bounds for this algorithm and demonstrate that its performance nearly matches the derived fundamental limits. Our results rely on an enhanced analysis of matrix denoising methods based on singular value thresholding. We validate our findings with applications to multivariate regression and linear dynamical system identification.
SIJun 18, 2023
Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block ModelKaito Ariu, Alexandre Proutiere, Se-Young Yun
In this paper, we investigate the problem of recovering hidden communities in the Labeled Stochastic Block Model (LSBM) with a finite number of clusters whose sizes grow linearly with the total number of nodes. We derive the necessary and sufficient conditions under which the expected number of misclassified nodes is less than $ s $, for any number $ s = o(n) $. To achieve this, we propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability. IAC is a novel two-phase algorithm that consists of a one-shot spectral clustering step followed by iterative likelihood-based cluster assignment improvements. This approach is based on the instance-specific lower bound and notably does not require any knowledge of the model parameters, including the number of clusters. By performing the spectral clustering only once, IAC maintains an overall computational complexity of $ \mathcal{O}(n\, \text{polylog}(n)) $, making it scalable and practical for large-scale problems.
LGOct 10, 2023
Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement LearningStefan Stojanovic, Yassir Jedra, Alexandre Proutiere
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise error. Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g. for MDPs, the data consists of system trajectories). We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error. These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure. We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for reward-free RL in low-rank MDPs. Both algorithms yield state-of-the-art performance guarantees.
LGMay 13
Switching Successor Measures for Hierarchical Zero-shot Reinforcement LearningStefan Stojanovic, Alexandre Proutiere
Hierarchical reinforcement learning can improve generalization by decomposing long-horizon decision-making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal-conditioned objectives, which largely confine them to goal-reaching tasks and limit their applicability to general reward functions. In this paper, we introduce switching successor measures, an extension of successor measures that enables hierarchical control in zero-shot reinforcement learning without additional supervision, fixed horizons, or manually designed subgoals. We show that switching successor measures arise naturally from classical successor measures while preserving their underlying structure. Building on this result, we propose FB $π$-Switch, an algorithm that extracts both a high-level subgoal-selection policy and a low-level control policy directly from forward-backward (FB) representations, allowing hierarchical behavior to emerge from a single learned representation. Experiments on both goal-conditioned and general reward-based tasks show that FB $π$-Switch improves over non-hierarchical baselines and matches state-of-the-art hierarchical methods in goal-conditioned settings. These results demonstrate that structured successor representations provide a flexible foundation for hierarchical zero-shot reinforcement learning beyond goal-reaching tasks. Our project website is available at: https://stestokth.github.io/switching-successors/.
SYOct 7, 2023
Sub-linear Regret in Adaptive Model Predictive ControlDamianos Tranos, Alexandre Proutiere
We consider the problem of adaptive Model Predictive Control (MPC) for uncertain linear-systems with additive disturbances and with state and input constraints. We present STT-MPC (Self-Tuning Tube-based Model Predictive Control), an online algorithm that combines the certainty-equivalence principle and polytopic tubes. Specifically, at any given step, STT-MPC infers the system dynamics using the Least Squares Estimator (LSE), and applies a controller obtained by solving an MPC problem using these estimates. The use of polytopic tubes is so that, despite the uncertainties, state and input constraints are satisfied, and recursive-feasibility and asymptotic stability hold. In this work, we analyze the regret of the algorithm, when compared to an oracle algorithm initially aware of the system dynamics. We establish that the expected regret of STT-MPC does not exceed $O(T^{1/2 + ε})$, where $ε\in (0,1)$ is a design parameter tuning the persistent excitation component of the algorithm. Our result relies on a recently proposed exponential decay of sensitivity property and, to the best of our knowledge, is the first of its kind in this setting. We illustrate the performance of our algorithm using a simple numerical example.
LGJul 21, 2024
Conformal Predictions under Markovian DataFrédéric Zheng, Alexandre Proutiere
We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present $K$-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to $t_\mathrm{mix}/(n\ln(n))$ without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets.
AIApr 10
Advantage-Guided Diffusion for Model-Based Reinforcement LearningDaniele Foffano, Arvid Eriksson, David Broman et al.
Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
OCApr 7
Optimal Centered Active Excitation in Linear System IdentificationKaito Ito, Alexandre Proutiere
We propose an active learning algorithm for linear system identification with optimal centered noise excitation. Notably, our algorithm, based on ordinary least squares and semidefinite programming, attains the minimal sample complexity while allowing for efficient computation of an estimate of a system matrix. More specifically, we first establish lower bounds of the sample complexity for any active learning algorithm to attain the prescribed accuracy and confidence levels. Next, we derive a sample complexity upper bound of the proposed algorithm, which matches the lower bound for any algorithm up to universal factors. Our tight bounds are easy to interpret and explicitly show their dependence on the system parameters such as the state dimension.
LGDec 19, 2023
Best Arm Identification with Fixed Budget: A Large Deviation PerspectivePo-An Wang, Ruo-Chun Tzeng, Alexandre Proutiere
We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated \sr (Successive Rejects) algorithm \citep{audibert2010best}, and (ii) to devise and analyze new algorithms. In particular, we present \sred (Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it any} round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that \sred enjoys better performance guarantees than existing algorithms, including \sr. Extensive numerical experiments confirm this observation.
AIApr 6
Receding-Horizon Control via Drifting ModelsDaniele Foffano, Alessio Russo, Alexandre Proutiere
We study the problem of trajectory optimization in settings where the system dynamics are unknown and it is not possible to simulate trajectories through a surrogate model. When an offline dataset of trajectories is available, an agent could directly learn a trajectory generator by distribution matching. However, this approach only recovers the behavior distribution in the dataset, and does not in general produce a model that minimizes a desired cost criterion. In this work, we propose Drifting MPC, an offline trajectory optimization framework that combines drifting generative models with receding-horizon planning under unknown dynamics. The goal of Drifting MPC is to learn, from an offline dataset of trajectories, a conditional distribution over trajectories that is both supported by the data and biased toward optimal plans. We show that the resulting distribution learned by Drifting MPC is the unique solution of an objective that trades off optimality with closeness to the offline prior. Empirically, we show that Drifting MPC can generate near-optimal trajectories while retaining the one-step inference efficiency of drifting models and substantially reducing generation time relative to diffusion-based baselines.
LGFeb 24, 2024
Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace RecoveryYassir Jedra, William Réveillard, Stefan Stojanovic et al.
We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-δ$ typically scales as ${r(m+n)\over \varepsilon^2}\log(1/δ)$. Our regret minimization algorithm enjoys minimax guarantees typically scaling as $r^{7/4}(m+n)^{3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.
MLOct 27, 2025
Minimizing Human Intervention in Online ClassificationWilliam Réveillard, Vasileios Saketos, Alexandre Proutiere et al.
We introduce and study an online problem arising in question answering systems. In this problem, an agent must sequentially classify user-submitted queries represented by $d$-dimensional embeddings drawn i.i.d. from an unknown distribution. The agent may consult a costly human expert for the correct label, or guess on her own without receiving feedback. The goal is to minimize regret against an oracle with free expert access. When the time horizon $T$ is at least exponential in the embedding dimension $d$, one can learn the geometry of the class regions: in this regime, we propose the Conservative Hull-based Classifier (CHC), which maintains convex hulls of expert-labeled queries and calls the expert as soon as a query lands outside all known hulls. CHC attains $\mathcal{O}(\log^d T)$ regret in $T$ and is minimax optimal for $d=1$. Otherwise, the geometry cannot be reliably learned without additional distributional assumptions. We show that when the queries are drawn from a subgaussian mixture, for $T \le e^d$, a Center-based Classifier (CC) achieves regret proportional to $N\log{N}$ where $N$ is the number of labels. To bridge these regimes, we introduce the Generalized Hull-based Classifier (GHC), a practical extension of CHC that allows for more aggressive guessing via a tunable threshold parameter. Our approach is validated with experiments, notably on real-world question-answering datasets using embeddings derived from state-of-the-art large language models.
LGSep 28, 2025
Adversarial Diffusion for Robust Reinforcement LearningDaniele Foffano, Alessio Russo, Alexandre Proutiere
Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories "all at once", mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.
LGJun 17, 2025
Adaptive Reinforcement Learning for Unobservable Random DelaysJohn Wikman, Alexandre Proutiere, David Broman
In standard Reinforcement Learning (RL) settings, the interaction between the agent and the environment is typically modeled as a Markov Decision Process (MDP), which assumes that the agent observes the system state instantaneously, selects an action without delay, and executes it immediately. In real-world dynamic environments, such as cyber-physical systems, this assumption often breaks down due to delays in the interaction between the agent and the system. These delays can vary stochastically over time and are typically unobservable, meaning they are unknown when deciding on an action. Existing methods deal with this uncertainty conservatively by assuming a known fixed upper bound on the delay, even if the delay is often much lower. In this work, we introduce the interaction layer, a general framework that enables agents to adaptively and seamlessly handle unobservable and time-varying delays. Specifically, the agent generates a matrix of possible future actions to handle both unpredictable delays and lost action packets sent over networks. Building on this framework, we develop a model-based algorithm, Actor-Critic with Delay Adaptation (ACDA), which dynamically adjusts to delay patterns. Our method significantly outperforms state-of-the-art approaches across a wide range of locomotion benchmark environments.
MLMay 21, 2025
Policy Testing in Markov Decision ProcessesKaito Ariu, Po-An Wang, Alexandre Proutiere et al.
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
LGOct 30, 2024
Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix EstimationStefan Stojanovic, Yassir Jedra, Alexandre Proutiere
We consider the problem of learning an $\varepsilon$-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an $\varepsilon$-optimal policy using $\widetilde{O}({S+A\over \mathrm{poly}(1-γ)\varepsilon^2})$ samples where $S$ (resp. $A$) denotes the number of states (resp. actions) and $γ$ the discount factor. Our algorithm achieves this order-optimal (in $S$, $A$ and $\varepsilon$) sample complexity under milder conditions than those assumed in previously proposed approaches.
LGJun 30, 2024
Model-Free Active Exploration in Reinforcement LearningAlessio Russo, Alexandre Proutiere
We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches
LGJan 6, 2022
Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandit ApproachFilippo Vannella, Alexandre Proutiere, Yassir Jedra et al.
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. In this paper, we devise algorithms learning optimal tilt control policies from existing data (in the so-called passive learning setting) or from data actively generated by the algorithms (the active learning setting). We formalize the design of such algorithms as a Best Policy Identification (BPI) problem in Contextual Linear Multi-Arm Bandits (CL-MAB). An arm represents an antenna tilt update; the context captures current network conditions; the reward corresponds to an improvement of performance, mixing coverage and capacity; and the objective is to identify, with a given level of confidence, an approximately optimal policy (a function mapping the context to an arm with maximal reward). For CL-MAB in both active and passive learning settings, we derive information-theoretical lower bounds on the number of samples required by any algorithm returning an approximately optimal policy with a given level of certainty, and devise algorithms achieving these fundamental limits. We apply our algorithms to the Remote Electrical Tilt (RET) optimization problem in cellular networks, and show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
LGSep 29, 2021
Minimal Expected Regret in Linear Quadratic ControlYassir Jedra, Alexandre Proutiere
We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by $\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by $\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild non-degeneracy condition ($d_x$ and $d_u$ denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in $T$, $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when $d_x\le d_u$ [SF20], and in scenario (ii) [lai1986]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting). Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on $A$ and $B$ and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of $A$ and $B$ and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.
SYSep 15, 2021
Balancing detectability and performance of attacks on the control channel of Markov Decision ProcessesAlessio Russo, Alexandre Proutiere
We investigate the problem of designing optimal stealthy poisoning attacks on the control channel of Markov decision processes (MDPs). This research is motivated by the recent interest of the research community for adversarial and poisoning attacks applied to MDPs, and reinforcement learning (RL) methods. The policies resulting from these methods have been shown to be vulnerable to attacks perturbing the observations of the decision-maker. In such an attack, drawing inspiration from adversarial examples used in supervised learning, the amplitude of the adversarial perturbation is limited according to some norm, with the hope that this constraint will make the attack imperceptible. However, such constraints do not grant any level of undetectability and do not take into account the dynamic nature of the underlying Markov process. In this paper, we propose a new attack formulation, based on information-theoretical quantities, that considers the objective of minimizing the detectability of the attack as well as the performance of the controlled process. We analyze the trade-off between the efficiency of the attack and its detectability. We conclude with examples and numerical simulations illustrating this trade-off.
LGSep 13, 2021
Online Learning of Optimally Diverse RankingsStefan Magureanu, Alexandre Proutiere, Marcus Isaksson et al.
Search engines answer users' queries by listing relevant items (e.g. documents, songs, products, web pages, ...). These engines rely on algorithms that learn to rank items so as to present an ordered list maximizing the probability that it contains relevant item. The main challenge in the design of learning-to-rank algorithms stems from the fact that queries often have different meanings for different users. In absence of any contextual information about the query, one often has to adhere to the {\it diversity} principle, i.e., to return a list covering the various possible topics or meanings of the query. To formalize this learning-to-rank problem, we propose a natural model where (i) items are categorized into topics, (ii) users find items relevant only if they match the topic of their query, and (iii) the engine is not aware of the topic of an arriving query, nor of the frequency at which queries related to various topics arrive, nor of the topic-dependent click-through-rates of the items. For this problem, we devise LDR (Learning Diverse Rankings), an algorithm that efficiently learns the optimal list based on users' feedback only. We show that after $T$ queries, the regret of LDR scales as $O((N-L)\log(T))$ where $N$ is the number of all items. We further establish that this scaling cannot be improved, i.e., LDR is order optimal. Finally, using numerical experiments on both artificial and real-world data, we illustrate the superiority of LDR compared to existing learning-to-rank algorithms.
LGJun 27, 2021
Regret Analysis in Deterministic Reinforcement LearningDamianos Tranos, Alexandre Proutiere
We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter (in contrast to previous minimax approaches) and thus, truly quantify the fundamental limit of performance achievable by any learning algorithm. Deterministic MDPs can be interpreted as graphs and analyzed in terms of their cycles, a fact which we leverage in order to identify a class of deterministic MDPs whose regret lower bound can be determined numerically. We further exemplify this result on a deterministic line search problem, and a deterministic MDP with state-dependent rewards, whose regret lower bounds we can state explicitly. These bounds share similarities with the known problem-specific bound of the multi-armed bandit problem and suggest that navigation on a deterministic MDP need not have an effect on the performance of a learning algorithm.
MLJun 5, 2021
Navigating to the Best Policy in Markov Decision ProcessesAymen Al Marjani, Aurélien Garivier, Alexandre Proutiere
We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-δ$. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}, induced by the \emph{online setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.
MLSep 28, 2020
Adaptive Sampling for Best Policy Identification in Markov Decision ProcessesAymen Al Marjani, Alexandre Proutiere
We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.
MLJun 29, 2020
Optimal Best-arm Identification in Linear BanditsYassir Jedra, Alexandre Proutiere
We study the problem of best-arm identification with fixed confidence in stochastic linear bandits. The objective is to identify the best arm with a given level of certainty while minimizing the sampling budget. We devise a simple algorithm whose sampling complexity matches known instance-specific lower bounds, asymptotically almost surely and in expectation. The algorithm relies on an arm sampling rule that tracks an optimal proportion of arm draws, and that remarkably can be updated as rarely as we wish, without compromising its theoretical guarantees. Moreover, unlike existing best-arm identification strategies, our algorithm uses a stopping rule that does not depend on the number of arms. Experimental results suggest that our algorithm significantly outperforms existing algorithms. The paper further provides a first analysis of the best-arm identification problem in linear bandits with a continuous set of arms.
LGMay 21, 2020
Off-policy Learning for Remote Electrical Tilt OptimizationFilippo Vannella, Jaeseong Jeong, Alexandre Proutiere
We address the problem of Remote Electrical Tilt (RET) optimization using off-policy Contextual Multi-Armed-Bandit (CMAB) techniques. The goal in RET optimization is to control the orientation of the vertical tilt angle of the antenna to optimize Key Performance Indicators (KPIs) representing the Quality of Service (QoS) perceived by the users in cellular networks. Learning an improved tilt update policy is hard. On the one hand, coming up with a new policy in an online manner in a real network requires exploring tilt updates that have never been used before, and is operationally too risky. On the other hand, devising this policy via simulations suffers from the simulation-to-reality gap. In this paper, we circumvent these issues by learning an improved policy in an offline manner using existing data collected on real networks. We formulate the problem of devising such a policy using the off-policy CMAB framework. We propose CMAB learning algorithms to extract optimal tilt update policies from the data. We train and evaluate these policies on real-world 4G Long Term Evolution (LTE) cellular network data. Our policies show consistent improvements over the rule-based logging policy used to collect the data.
LGApr 2, 2020
Predictive BanditsSimon Lindståhl, Alexandre Proutiere, Andreas Johnsson
We introduce and study a new class of stochastic bandit problems, referred to as predictive bandits. In each round, the decision maker first decides whether to gather information about the rewards of particular arms (so that their rewards in this round can be predicted). These measurements are costly, and may be corrupted by noise. The decision maker then selects an arm to be actually played in the round. Predictive bandits find applications in many areas; e.g. they can be applied to channel selection problems in radio communication systems. In this paper, we provide the first theoretical results about predictive bandits, and focus on scenarios where the decision maker is allowed to measure at most one arm per round. We derive asymptotic instance-specific regret lower bounds for these problems, and develop algorithms whose regret match these fundamental limits. We illustrate the performance of our algorithms through numerical experiments. In particular, we highlight the gains that can be achieved by using reward predictions, and investigate the impact of the noise in the corresponding measurements.
STMar 17, 2020
Finite-time Identification of Stable Linear Systems: Optimality of the Least-Squares EstimatorYassir Jedra, Alexandre Proutiere
We present a new finite-time analysis of the estimation error of the Ordinary Least Squares (OLS) estimator for stable linear time-invariant systems. We characterize the number of observed samples (the length of the observed trajectory) sufficient for the OLS estimator to be $(\varepsilon,δ)$-PAC, i.e., to yield an estimation error less than $\varepsilon$ with probability at least $1-δ$. We show that this number matches existing sample complexity lower bounds [1,2] up to universal multiplicative factors (independent of ($\varepsilon,δ)$ and of the system). This paper hence establishes the optimality of the OLS estimator for stable systems, a result conjectured in [1]. Our analysis of the performance of the OLS estimator is simpler, sharper, and easier to interpret than existing analyses. It relies on new concentration results for the covariates matrix.
LGDec 20, 2019
Distributed Online Optimization with Long-Term ConstraintsDeming Yuan, Alexandre Proutiere, Guodong Shi
We consider distributed online convex optimization problems, where the distributed system consists of various computing units connected through a time-varying communication graph. In each time step, each computing unit selects a constrained vector, experiences a loss equal to an arbitrary convex function evaluated at this vector, and may communicate to its neighbors in the graph. The objective is to minimize the system-wide loss accumulated over time. We propose a decentralized algorithm with regret and cumulative constraint violation in $\mathcal{O}(T^{\max\{c,1-c\} })$ and $\mathcal{O}(T^{1-c/2})$, respectively, for any $c\in (0,1)$, where $T$ is the time horizon. When the loss functions are strongly convex, we establish improved regret and constraint violation upper bounds in $\mathcal{O}(\log(T))$ and $\mathcal{O}(\sqrt{T\log(T)})$. These regret scalings match those obtained by state-of-the-art algorithms and fundamental limits in the corresponding centralized online optimization problem (for both convex and strongly convex loss functions). In the case of bandit feedback, the proposed algorithms achieve a regret and constraint violation in $\mathcal{O}(T^{\max\{c,1-c/3 \} })$ and $\mathcal{O}(T^{1-c/2})$ for any $c\in (0,1)$. We numerically illustrate the performance of our algorithms for the particular case of distributed online regularized linear regression problems.
MLOct 14, 2019
Optimal Clustering from Noisy Binary FeedbackKaito Ariu, Jungseul Ok, Alexandre Proutiere et al.
We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the {\it hardness} of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.
LGSep 28, 2019
An Optimal Algorithm for Multiplayer Multi-Armed BanditsAlexandre Proutiere, Po-An Wang
The paper addresses the Multiplayer Multi-Armed Bandit (MMAB) problem, where $M$ decision makers or players collaborate to maximize their cumulative reward. When several players select the same arm, a collision occurs and no reward is collected on this arm. Players involved in a collision are informed about this collision. We present DPE (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same regret as that obtained by an optimal centralized algorithm. Our algorithm has better regret guarantees than the state-of-the-art algorithm SIC-MMAB \cite{boursier2019}. As in SIC-MMAB, players communicate through collisions only. An additional important advantage of DPE is that it requires very little communication. Specifically, the expected number of rounds where players use collisions to communicate is finite.
LGJul 31, 2019
Optimal Attacks on Reinforcement Learning PoliciesAlessio Russo, Alexandre Proutiere
Control policies, trained using the Deep Reinforcement Learning, have been recently shown to be vulnerable to adversarial attacks introducing even very small perturbations to the policy input. The attacks proposed so far have been designed using heuristics, and build on existing adversarial example crafting techniques used to dupe classifiers in supervised learning. In contrast, this paper investigates the problem of devising optimal attacks, depending on a well-defined attacker's objective, e.g., to minimize the main agent average reward. When the policy and the system dynamics, as well as rewards, are known to the attacker, a scenario referred to as a white-box attack, designing optimal attacks amounts to solving a Markov Decision Process. For what we call black-box attacks, where neither the policy nor the system is known, optimal attacks can be trained using Reinforcement Learning techniques. Through numerical experiments, we demonstrate the efficiency of our attacks compared to existing attacks (usually based on Gradient methods). We further quantify the potential impact of attacks and establish its connection to the smoothness of the policy under attack. Smooth policies are naturally less prone to attacks (this explains why Lipschitz policies, with respect to the state, are more resilient). Finally, we show that from the main agent perspective, the system uncertainties and the attacker can be modeled as a Partially Observable Markov Decision Process. We actually demonstrate that using Reinforcement Learning techniques tailored to POMDP (e.g. using Recurrent Neural Networks) leads to more resilient policies.
OCJun 27, 2019
From self-tuning regulators to reinforcement learning and back againNikolai Matni, Alexandre Proutiere, Anders Rantzer et al.
Machine and reinforcement learning (RL) are increasingly being applied to plan and control the behavior of autonomous systems interacting with the physical world. Examples include self-driving vehicles, distributed sensor networks, and agile robots. However, when machine learning is to be applied in these new settings, the algorithms had better come with the same type of reliability, robustness, and safety bounds that are hallmarks of control theory, or failures could be catastrophic. Thus, as learning algorithms are increasingly and more aggressively deployed in safety critical settings, it is imperative that control theorists join the conversation. The goal of this tutorial paper is to provide a starting point for control theorists wishing to work on learning related problems, by covering recent advances bridging learning and control theory, and by placing these results within an appropriate historical context of system identification and adaptive control.
SYMar 25, 2019
Sample Complexity Lower Bounds for Linear System IdentificationYassir Jedra, Alexandre Proutiere
This paper establishes problem-specific sample complexity lower bounds for linear system identification problems. The sample complexity is defined in the PAC framework: it corresponds to the time it takes to identify the system parameters with prescribed accuracy and confidence levels. By problem-specific, we mean that the lower bound explicitly depends on the system to be identified (which contrasts with minimax lower bounds), and hence really captures the identification hardness specific to the system. We consider both uncontrolled and controlled systems. For uncontrolled systems, the lower bounds are valid for any linear system, stable or not, and only depend of the system finite-time controllability gramian. A simplified lower bound depending on the spectrum of the system only is also derived. In view of recent finitetime analysis of classical estimation methods (e.g. ordinary least squares), our sample complexity lower bounds are tight for many systems. For controlled systems, our lower bounds are not as explicit as in the case of uncontrolled systems, but could well provide interesting insights into the design of control policy with minimal sample complexity.
LGFeb 13, 2019
Distributed Online Linear RegressionDeming Yuan, Alexandre Proutiere, Guodong Shi
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set.
CVJul 2, 2018
Learning to Personalize in Appearance-Based Gaze TrackingErik Lindén, Jonas Sjöstrand, Alexandre Proutiere
Personal variations severely limit the performance of appearance-based gaze tracking. Adapting to these variations using standard neural network model adaptation methods is difficult. The problems range from overfitting, due to small amounts of training data, to underfitting, due to restrictive model architectures. We tackle these problems by introducing the SPatial Adaptive GaZe Estimator (SPAZE). By modeling personal variations as a low-dimensional latent parameter space, SPAZE provides just enough adaptability to capture the range of personal variations without being prone to overfitting. Calibrating SPAZE for a new person reduces to solving a small optimization problem. SPAZE achieves an error of 2.70 degrees with 9 calibration samples on MPIIGaze, improving on the state-of-the-art by 14 %. We contribute to gaze tracking research by empirically showing that personal variations are well-modeled as a 3-dimensional latent parameter space for each eye. We show that this low-dimensionality is expected by examining model-based approaches to gaze tracking. We also show that accurate head pose-free gaze tracking is possible.
LGJun 3, 2018
Exploration in Structured Reinforcement LearningJungseul Ok, Alexandre Proutiere, Damianos Tranos
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward functions are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c\log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA\log T$. We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.
MLNov 1, 2017
Minimal Exploration in Structured Stochastic BanditsRichard Combes, Stefan Magureanu, Alexandre Proutiere
This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, ...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of "optimism in the face of uncertainty" or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling.
PROct 20, 2015
Optimal Cluster Recovery in the Labeled Stochastic Block ModelSe-Young Yun, Alexandre Proutiere
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The objective is to reconstruct the clusters from the observation of these random labels. Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find the set of parameters such that there exists a clustering algorithm with at most $s$ misclassified items in average under the general LSBM and for any $s=o(n)$, which solves one open problem raised in \cite{abbe2015community}. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within $O(n \mbox{polylog}(n))$ computations and without the a-priori knowledge of the model parameters.
LGJul 12, 2015
Cluster-Aided Mobility PredictionsJaeseong Jeong, Mathieu Leconte, Alexandre Proutiere
Predicting the future location of users in wireless net- works has numerous applications, and can help service providers to improve the quality of service perceived by their clients. The location predictors proposed so far estimate the next location of a specific user by inspecting the past individual trajectories of this user. As a consequence, when the training data collected for a given user is limited, the resulting prediction is inaccurate. In this paper, we develop cluster-aided predictors that exploit past trajectories collected from all users to predict the next location of a given user. These predictors rely on clustering techniques and extract from the training data similarities among the mobility patterns of the various users to improve the prediction accuracy. Specifically, we present CAMP (Cluster-Aided Mobility Predictor), a cluster-aided predictor whose design is based on recent non-parametric bayesian statistical tools. CAMP is robust and adaptive in the sense that it exploits similarities in users' mobility only if such similarities are really present in the training data. We analytically prove the consistency of the predictions provided by CAMP, and investigate its performance using two large-scale datasets. CAMP significantly outperforms existing predictors, and in particular those that only exploit individual past trajectories.
SPApr 13, 2015
Streaming, Memory Limited Matrix Completion with NoiseSe-Young Yun, Marc Lelarge, Alexandre Proutiere
In this paper, we consider the streaming memory-limited matrix completion problem when the observed entries are noisy versions of a small random fraction of the original entries. We are interested in scenarios where the matrix size is very large so the matrix is very hard to store and manipulate. Here, columns of the observed matrix are presented sequentially and the goal is to complete the missing entries after one pass on the data with limited memory space and limited computational complexity. We propose a streaming algorithm which produces an estimate of the original matrix with a vanishing mean square error, uses memory space scaling linearly with the ambient dimension of the matrix, i.e. the memory required to store the output alone, and spends computations as much as the number of non-zero entries of the input matrix.
LGFeb 11, 2015
Combinatorial Bandits RevisitedRichard Combes, M. Sadegh Talebi, Alexandre Proutiere et al.
This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.
LGJun 28, 2014
Unimodal Bandits without SmoothnessRichard Combes, Alexandre Proutiere
We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. No further assumption is made regarding the smoothness and the structure of the expected reward function. For these problems, we propose the Stochastic Pentachotomy (SP) algorithm, and derive finite-time upper bounds on its regret and optimization error. In particular, we show that, for any expected reward function $μ$ that behaves as $μ(x)=μ(x^\star)-C|x-x^\star|^ξ$ locally around its maximizer $x^\star$ for some $ξ, C>0$, the SP algorithm is order-optimal. Namely its regret and optimization error scale as $O(\sqrt{T\log(T)})$ and $O(\sqrt{\log(T)/T})$, respectively, when the time horizon $T$ grows large. These scalings are achieved without the knowledge of $ξ$ and $C$. Our algorithm is based on asymptotically optimal sequential statistical tests used to successively trim an interval that contains the best arm with high probability. To our knowledge, the SP algorithm constitutes the first sequential arm selection rule that achieves a regret and optimization error scaling as $O(\sqrt{T})$ and $O(1/\sqrt{T})$, respectively, up to a logarithmic factor for non-smooth expected reward functions, as well as for smooth functions with unknown smoothness.
LGMay 20, 2014
Unimodal Bandits: Regret Lower Bounds and Optimal AlgorithmsRichard Combes, Alexandre Proutiere
We consider stochastic multi-armed bandits where the expected reward is a unimodal function over partially ordered arms. This important class of problems has been recently investigated in (Cope 2009, Yu 2011). The set of arms is either discrete, in which case arms correspond to the vertices of a finite graph whose structure represents similarity in rewards, or continuous, in which case arms belong to a bounded interval. For discrete unimodal bandits, we derive asymptotic lower bounds for the regret achieved under any algorithm, and propose OSUB, an algorithm whose regret matches this lower bound. Our algorithm optimally exploits the unimodal structure of the problem, and surprisingly, its asymptotic regret does not depend on the number of arms. We also provide a regret upper bound for OSUB in non-stationary environments where the expected rewards smoothly evolve over time. The analytical results are supported by numerical experiments showing that OSUB performs significantly better than the state-of-the-art algorithms. For continuous sets of arms, we provide a brief discussion. We show that combining an appropriate discretization of the set of arms with the UCB algorithm yields an order-optimal regret, and in practice, outperforms recently proposed algorithms designed to exploit the unimodal structure.