Cem Tekin

LG
h-index7
40papers
611citations
Novelty56%
AI Score44

40 Papers

LGMay 9, 2022
Federated Multi-Armed Bandits Under Byzantine Attacks

Artun Saday, İlker Demirel, Yiğit Yıldırım et al.

Multi-armed bandits (MAB) is a sequential decision-making model in which the learner controls the trade-off between exploration and exploitation to maximize its cumulative reward. Federated multi-armed bandits (FMAB) is an emerging framework where a cohort of learners with heterogeneous local models play an MAB game and communicate their aggregated feedback to a server to learn a globally optimal arm. Two key hurdles in FMAB are communication-efficient learning and resilience to adversarial attacks. To address these issues, we study the FMAB problem in the presence of Byzantine clients who can send false model updates threatening the learning process. We analyze the sample complexity and the regret of $β$-optimal arm identification. We borrow tools from robust statistics and propose a median-of-means (MoM)-based online algorithm, Fed-MoM-UCB, to cope with Byzantine clients. In particular, we show that if the Byzantine clients constitute less than half of the cohort, the cumulative regret with respect to $β$-optimal arms is bounded over time with high probability, showcasing both communication efficiency and Byzantine resilience. We analyze the interplay between the algorithm parameters, a discernibility margin, regret, communication cost, and the arms' suboptimality gaps. We demonstrate Fed-MoM-UCB's effectiveness against the baselines in the presence of Byzantine attacks via experiments.

LGAug 16, 2023
Robust Bayesian Satisficing

Artun Saday, Yaşar Cahit Yıldırım, Cem Tekin

Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.

5.5LGApr 16
Multi-User mmWave Beam and Rate Adaptation via Combinatorial Satisficing Bandits

Emre Özyıldırım, Barış Yaycı, Umut Eren Akturk et al.

We study downlink beam and rate adaptation in a multi-user mmWave MISO system where multiple base stations (BSs), each using analog beamforming from finite codebooks, serve multiple single-antenna user equipments (UEs) with a unique beam per UE and discrete data transmission rates. BSs learn about transmission success based on ACK/NACK feedback. To encode service goals, we introduce a satisficing throughput threshold $τ_r$ and cast joint beam and rate adaptation as a combinatorial semi-bandit over beam-rate tuples. Within this framework, we propose SAT-CTS, a lightweight, threshold-aware policy that blends conservative confidence estimates with posterior sampling, steering learning toward meeting $τ_r$ rather than merely maximizing. Our main theoretical contribution provides the first finite-time regret bounds for combinatorial semi-bandits with satisficing objective: when $τ_r$ is realizable, we upper bound the cumulative satisficing regret to the target with a time-independent constant, and when $τ_r$ is non-realizable, we show that SAT-CTS incurs only a finite expected transient outside committed CTS rounds, after which its regret is governed by the sum of the regret contributions of restarted CTS rounds, yielding an $O((\log T)^2)$ standard regret bound. On the practical side, we evaluate the performance via cumulative satisficing regret to $τ_r$ alongside standard regret and fairness. Experiments with time-varying sparse multipath channels show that SAT-CTS consistently reduces satisficing regret and maintains competitive standard regret, while achieving favorable average throughput and fairness across users, indicating that feedback-efficient learning can equitably allocate beams and rates to meet QoS targets without channel state knowledge.

LGJun 6, 2022
Robust Pareto Set Identification with Contaminated Bandit Feedback

İlter Onat Korkmaz, Efe Eren Ceyani, Kerem Bozgan et al.

We consider the Pareto set identification (PSI) problem in multi-objective multi-armed bandits (MO-MAB) with contaminated reward observations. At each arm pull, with some fixed probability, the true reward samples are replaced with the samples from an arbitrary contamination distribution chosen by an adversary. We consider (α, δ)-PAC PSI and propose a sample median-based multi-objective adaptive elimination algorithm that returns an (α, δ)- PAC Pareto set upon termination with a sample complexity bound that depends on the contamination probability. As the contamination probability decreases, we recover the wellknown sample complexity results in MO-MAB. We compare the proposed algorithm with a mean-based method from MO-MAB literature, as well as an extended version that uses median estimators, on several PSI problems under adversarial corruptions, including review bombing and diabetes management. Our numerical results support our theoretical findings and demonstrate that robust algorithm design is crucial for accurate PSI under contaminated reward observations.

LGDec 9, 2024Code
VOPy: A Framework for Black-box Vector Optimization

Yaşar Cahit Yıldırım, Efe Mert Karagözlü, İlter Onat Korkmaz et al.

We introduce VOPy, an open-source Python library designed to address black-box vector optimization, where multiple objectives must be optimized simultaneously with respect to a partial order induced by a convex cone. VOPy extends beyond traditional multi-objective optimization (MOO) tools by enabling flexible, cone-based ordering of solutions; with an application scope that includes environments with observation noise, discrete or continuous design spaces, limited budgets, and batch observations. VOPy provides a modular architecture, facilitating the integration of existing methods and the development of novel algorithms. We detail VOPy's architecture, usage, and potential to advance research and application in the field of vector optimization. The source code for VOPy is available at https://github.com/Bilkent-CYBORG/VOPy.

LGMay 21, 2025
Cost-aware LLM-based Online Dataset Annotation

Eray Can Elumar, Cem Tekin, Osman Yagan

Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.

LGJun 2, 2025
Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks

Artun Saday, Yaşar Cahit Yıldırım, Cem Tekin

We address the problem of Gaussian Process (GP) optimization in the presence of unknown and potentially varying adversarial perturbations. Unlike traditional robust optimization approaches that focus on maximizing performance under worst-case scenarios, we consider a robust satisficing objective, where the goal is to consistently achieve a predefined performance threshold $τ$, even under adversarial conditions. We propose two novel algorithms based on distinct formulations of robust satisficing, and show that they are instances of a general robust satisficing framework. Further, each algorithm offers different guarantees depending on the nature of the adversary. Specifically, we derive two regret bounds: one that is sublinear over time, assuming certain conditions on the adversary and the satisficing threshold $τ$, and another that scales with the perturbation magnitude but requires no assumptions on the adversary. Through extensive experiments, we demonstrate that our approach outperforms the established robust optimization methods in achieving the satisficing objective, particularly when the ambiguity set of the robust optimization framework is inaccurately specified.

LGJan 30, 2025
Bandits with Anytime Knapsacks

Eray Can Elumar, Cem Tekin, Osman Yagan

We consider bandits with anytime knapsacks (BwAK), a novel version of the BwK problem where there is an \textit{anytime} cost constraint instead of a total cost budget. This problem setting introduces additional complexities as it mandates adherence to the constraint throughout the decision-making process. We propose SUAK, an algorithm that utilizes upper confidence bounds to identify the optimal mixture of arms while maintaining a balance between exploration and exploitation. SUAK is an adaptive algorithm that strategically utilizes the available budget in each round in the decision-making process and skips a round when it is possible to violate the anytime cost constraint. In particular, SUAK slightly under-utilizes the available cost budget to reduce the need for skipping rounds. We show that SUAK attains the same problem-dependent regret upper bound of $ O(K \log T)$ established in prior work under the simpler BwK framework. Finally, we provide simulations to verify the utility of SUAK in practical settings.

LGDec 3, 2024
Vector Optimization with Gaussian Process Bandits

İlter Onat Korkmaz, Yaşar Cahit Yıldırım, Çağın Ararat et al.

Learning problems in which multiple conflicting objectives must be considered simultaneously often arise in various fields, including engineering, drug design, and environmental management. Traditional methods for dealing with multiple black-box objective functions, such as scalarization and identification of the Pareto set under the componentwise order, have limitations in incorporating objective preferences and exploring the solution space accordingly. While vector optimization offers improved flexibility and adaptability via specifying partial orders based on ordering cones, current techniques designed for sequential experiments either suffer from high sample complexity or lack theoretical guarantees. To address these issues, we propose Vector Optimization with Gaussian Process (VOGP), a probably approximately correct adaptive elimination algorithm that performs black-box vector optimization using Gaussian process bandits. VOGP allows users to convey objective preferences through ordering cones while performing efficient sampling by exploiting the smoothness of the objective function, resulting in a more effective optimization process that requires fewer evaluations. We establish theoretical guarantees for VOGP and derive information gain-based and kernel-specific sample complexity bounds. We also conduct experiments on both real-world and synthetic datasets to compare VOGP with the state-of-the-art methods.

LGDec 13, 2021
Safe Linear Leveling Bandits

Ilker Demirel, Mehmet Ufuk Ozdemir, Cem Tekin

Multi-armed bandits (MAB) are extensively studied in various settings where the objective is to \textit{maximize} the actions' outcomes (i.e., rewards) over time. Since safety is crucial in many real-world problems, safe versions of MAB algorithms have also garnered considerable interest. In this work, we tackle a different critical task through the lens of \textit{linear stochastic bandits}, where the aim is to keep the actions' outcomes close to a target level while respecting a \textit{two-sided} safety constraint, which we call \textit{leveling}. Such a task is prevalent in numerous domains. Many healthcare problems, for instance, require keeping a physiological variable in a range and preferably close to a target level. The radical change in our objective necessitates a new acquisition strategy, which is at the heart of a MAB algorithm. We propose SALE-LTS: Safe Leveling via Linear Thompson Sampling algorithm, with a novel acquisition strategy to accommodate our task and show that it achieves sublinear regret with the same time and dimension dependence as previous works on the classical reward maximization problem absent any safety constraint. We demonstrate and discuss our algorithm's empirical performance in detail via thorough experiments.

LGNov 29, 2021
Contextual Combinatorial Multi-output GP Bandits with Group Constraints

Sepehr Elahi, Baran Atalar, Sevda Öğüt et al.

In federated multi-armed bandit problems, maximizing global reward while satisfying minimum privacy requirements to protect clients is the main goal. To formulate such problems, we consider a combinatorial contextual bandit setting with groups and changing action sets, where similar base arms arrive in groups and a set of base arms, called a super arm, must be chosen in each round to maximize super arm reward while satisfying the constraints of the rewards of groups from which base arms were chosen. To allow for greater flexibility, we let each base arm have two outcomes, modeled as the output of a two-output Gaussian process (GP), where one outcome is used to compute super arm reward and the other for group reward. We then propose a novel double-UCB GP-bandit algorithm, called Thresholded Combinatorial Gaussian Process Upper Confidence Bounds (TCGP-UCB), which balances between maximizing cumulative super arm reward and satisfying group reward constraints and can be tuned to prefer one over the other. We also define a new notion of regret that combines super arm regret with group reward constraint satisfaction and prove that TCGP-UCB incurs $\tilde{O}(\sqrt{λ^*(K)KT\overlineγ_{T}} )$ regret with high probability, where $\overlineγ_{T}$ is the maximum information gain associated with the set of base arm contexts that appeared in the first $T$ rounds and $K$ is the maximum super arm cardinality over all rounds. We lastly show in experiments using synthetic and real-world data and based on a federated learning setup as well as a content-recommendation one that our algorithm performs better then the current non-GP state-of-the-art combinatorial bandit algorithm, while satisfying group constraints.

LGNov 26, 2021
ESCADA: Efficient Safety and Context Aware Dose Allocation for Precision Medicine

Ilker Demirel, Ahmet Alparslan Celik, Cem Tekin

Finding an optimal individualized treatment regimen is considered one of the most challenging precision medicine problems. Various patient characteristics influence the response to the treatment, and hence, there is no one-size-fits-all regimen. Moreover, the administration of an unsafe dose during the treatment can have adverse effects on health. Therefore, a treatment model must ensure patient \emph{safety} while \emph{efficiently} optimizing the course of therapy. We study a prevalent medical problem where the treatment aims to keep a physiological variable in a safe range and preferably close to a target level, which we refer to as \emph{leveling}. Such a task may be relevant in numerous other domains as well. We propose ESCADA, a novel and generic multi-armed bandit (MAB) algorithm tailored for the leveling task, to make safe, personalized, and context-aware dose recommendations. We derive high probability upper bounds on its cumulative regret and safety guarantees. Following ESCADA's design, we also describe its Thompson sampling-based counterpart. We discuss why the straightforward adaptations of the classical MAB algorithms such as GP-UCB may not be a good fit for the leveling task. Finally, we make \emph{in silico} experiments on the bolus-insulin dose allocation problem in type-1 diabetes mellitus disease and compare our algorithms against the famous GP-UCB algorithm, the rule-based dose calculators, and a clinician.

LGOct 23, 2021
Vector Optimization with Stochastic Bandit Feedback

Çağın Ararat, Cem Tekin

We introduce vector optimization problems with stochastic bandit feedback, in which preferences among designs are encoded by a polyhedral ordering cone $C$. Our setup generalizes the best arm identification problem to vector-valued rewards by extending the concept of Pareto set beyond multi-objective optimization. We characterize the sample complexity of ($ε,δ$)-PAC Pareto set identification by defining a new cone-dependent notion of complexity, called the ordering complexity. In particular, we provide gap-dependent and worst-case lower bounds on the sample complexity and show that, in the worst-case, the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the naïve elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, the returned ($ε,δ$)-PAC Pareto set, and the success of identification.

LGOct 5, 2021
Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes

Andi Nika, Sepehr Elahi, Cem Tekin

We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set ${\cal X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{λ^*(K)KTγ_{KT}(\cup_{t\leq T}\mathcal{X}_t)} )$ regret with high probability, where $γ_{KT}(\cup_{t\leq T}\mathcal{X}_t)$ is the maximum information gain associated with the sets of base arm contexts $\mathcal{X}_t$ that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds, and $λ^*(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.

LGSep 8, 2021
Conservative Policy Construction Using Variational Autoencoders for Logged Data with Missing Values

Mahed Abroshan, Kai Hou Yip, Cem Tekin et al.

In high-stakes applications of data-driven decision making like healthcare, it is of paramount importance to learn a policy that maximizes the reward while avoiding potentially dangerous actions when there is uncertainty. There are two main challenges usually associated with this problem. Firstly, learning through online exploration is not possible due to the critical nature of such applications. Therefore, we need to resort to observational datasets with no counterfactuals. Secondly, such datasets are usually imperfect, additionally cursed with missing values in the attributes of features. In this paper, we consider the problem of constructing personalized policies using logged data when there are missing values in the attributes of features in both training and test data. The goal is to recommend an action (treatment) when $\Xt$, a degraded version of $\Xb$ with missing values, is observed. We consider three strategies for dealing with missingness. In particular, we introduce the \textit{conservative strategy} where the policy is designed to safely handle the uncertainty due to missingness. In order to implement this strategy we need to estimate posterior distribution $p(\Xb|\Xt)$, we use variational autoencoder to achieve this. In particular, our method is based on partial variational autoencoders (PVAE) which are designed to capture the underlying structure of features with missing values.

LGJun 24, 2020
Beyond Grids: Multi-objective Bayesian Optimization With Adaptive Discretization

Andi Nika, Sepehr Elahi, Çağın Ararat et al.

We consider the problem of optimizing a vector-valued objective function $\boldsymbol{f}$ sampled from a Gaussian Process (GP) whose index set is a well-behaved, compact metric space $({\cal X},d)$ of designs. We assume that $\boldsymbol{f}$ is not known beforehand and that evaluating $\boldsymbol{f}$ at design $x$ results in a noisy observation of $\boldsymbol{f}(x)$. Since identifying the Pareto optimal designs via exhaustive search is infeasible when the cardinality of ${\cal X}$ is large, we propose an algorithm, called Adaptive $\boldsymbolε$-PAL, that exploits the smoothness of the GP-sampled function and the structure of $({\cal X},d)$ to learn fast. In essence, Adaptive $\boldsymbolε$-PAL employs a tree-based adaptive discretization technique to identify an $\boldsymbolε$-accurate Pareto set of designs in as few evaluations as possible. We provide both information-type and metric dimension-type bounds on the sample complexity of $\boldsymbolε$-accurate Pareto set identification. We also experimentally show that our algorithm outperforms other Pareto set identification methods.

LGJul 26, 2019
Lexicographic Multiarmed Bandit

Alihan Hüyük, Cem Tekin

We consider a multiobjective multiarmed bandit problem with lexicographically ordered objectives. In this problem, the goal of the learner is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multidimensional form of regret that measures the loss of the learner due to not selecting lexicographic optimal arms, and then, consider two settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective near-lexicographic optimal expected rewards. For both settings we prove that the learner achieves expected regret uniformly bounded in time. The algorithm we propose for the second setting also attains bounded regret for the multiarmed bandit with satisficing objectives. In addition, we also consider the harder prior-free case, and show that the learner can still achieve sublinear in time gap-free regret. Finally, we experimentally evaluate performance of the proposed algorithms in a variety of multiobjective learning problems.

LGJul 7, 2019
Thompson Sampling for Combinatorial Network Optimization in Unknown Environments

Alihan Hüyük, Cem Tekin

Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. We consider a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS). Under the semi-bandit feedback model and assuming access to an oracle without knowing the expected base arm outcomes beforehand, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves $O(\sum_{i =1}^m\log T/(p_iΔ_i))$ regret and $O(\max\{\mathbb{E}[m\sqrt{T\log T/p^*}],\mathbb{E}[m^2/p^*]\})$ Bayesian regret, where $m$ denotes the number of base arms, $p_i$ and $Δ_i$ denote the minimum non-zero triggering probability and the minimum suboptimality gap of base arm $i$ respectively, $T$ denotes the time horizon, and $p^*$ denotes the overall minimum non-zero triggering probability. We also show that when the expected reward satisfies the triggering probability modulated Lipschitz continuity, CTS achieves $O(\max\{m\sqrt{T\log T},m^2\})$ Bayesian regret, and when triggering probabilities are non-zero for all base arms, CTS achieves $O(1/p^*\log(1/p^*))$ regret independent of the time horizon. Finally, we numerically compare CTS with algorithms based on upper confidence bounds in several networking problems and show that CTS outperforms these algorithms by at least an order of magnitude in majority of the cases.

LGJul 1, 2019
Exploiting Relevance for Online Decision-Making in High-Dimensions

Eralp Turgay, Cem Bulucu, Cem Tekin

Many sequential decision-making tasks require choosing at each decision step the right action out of the vast set of possibilities by extracting actionable intelligence from high-dimensional data streams. Most of the times, the high-dimensionality of actions and data makes learning of the optimal actions by traditional learning methods impracticable. In this work, we investigate how to discover and leverage sparsity in actions and data to enable fast learning. As our learning model, we consider a structured contextual multi-armed bandit (CMAB) with high-dimensional arm (action) and context (data) sets, where the rewards depend only on a few relevant dimensions of the joint context-arm set, possibly in a non-linear way. We depart from the prior work by assuming a high-dimensional, continuum set of arms, and allow relevant context dimensions to vary for each arm. We propose a new online learning algorithm called {\em CMAB with Relevance Learning} (CMAB-RL) and prove that its time-averaged regret asymptotically goes to zero when the expected reward varies smoothly in contexts and arms. CMAB-RL enjoys a substantially improved regret bound compared to classical CMAB algorithms whose regrets depend on dimensions $d_x$ and $d_a$ of the context and arm sets. Importantly, we show that when the learner has prior knowledge on sparsity, given in terms of upper bounds $\overline{d}_x$ and $\overline{d}_a$ on the number of relevant dimensions, then CMAB-RL achieves $\tilde{O}(T^{1-1/(2+2\overline{d}_x +\overline{d}_a)})$ regret. Finally, we illustrate how CMAB algorithms can be used for optimal personalized blood glucose control in type 1 diabetes mellitus patients, and show that CMAB-RL outperforms other contextual MAB algorithms in this task.

LGMay 2, 2019
Group Retention when Using Machine Learning in Sequential Decision Making: the Interplay between User Dynamics and Fairness

Xueru Zhang, Mohammad Mahdi Khalili, Cem Tekin et al.

Machine Learning (ML) models trained on data from multiple demographic groups can inherit representation disparity (Hashimoto et al., 2018) that may exist in the data: the model may be less favorable to groups contributing less to the training process; this in turn can degrade population retention in these groups over time, and exacerbate representation disparity in the long run. In this study, we seek to understand the interplay between ML decisions and the underlying group representation, how they evolve in a sequential framework, and how the use of fairness criteria plays a role in this process. We show that the representation disparity can easily worsen over time under a natural user dynamics (arrival and departure) model when decisions are made based on a commonly used objective and fairness criteria, resulting in some groups diminishing entirely from the sample pool in the long run. It highlights the fact that fairness criteria have to be defined while taking into consideration the impact of decisions on user dynamics. Toward this end, we explain how a proper fairness criterion can be selected based on a general user dynamics model.

LGSep 7, 2018
Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms

Alihan Hüyük, Cem Tekin

We analyze the regret of combinatorial Thompson sampling (CTS) for the combinatorial multi-armed bandit with probabilistically triggered arms under the semi-bandit feedback setting. We assume that the learner has access to an exact optimization oracle but does not know the expected base arm outcomes beforehand. When the expected reward function is Lipschitz continuous in the expected base arm outcomes, we derive $O(\sum_{i =1}^m \log T / (p_i Δ_i))$ regret bound for CTS, where $m$ denotes the number of base arms, $p_i$ denotes the minimum non-zero triggering probability of base arm $i$ and $Δ_i$ denotes the minimum suboptimality gap of base arm $i$. We also compare CTS with combinatorial upper confidence bound (CUCB) via numerical experiments on a cascading bandit problem.

LGMar 11, 2018
Combinatorial Multi-Objective Multi-Armed Bandit Problem

Doruk Öner, Altuğ Karakurt, Atilla Eryılmaz et al.

In this paper, we introduce the COmbinatorial Multi-Objective Multi-Armed Bandit (COMO-MAB) problem that captures the challenges of combinatorial and multi-objective online learning simultaneously. In this setting, the goal of the learner is to choose an action at each time, whose reward vector is a linear combination of the reward vectors of the arms in the action, to learn the set of super Pareto optimal actions, which includes the Pareto optimal actions and actions that become Pareto optimal after adding an arbitrary small positive number to their expected reward vectors. We define the Pareto regret performance metric and propose a fair learning algorithm whose Pareto regret is $O(N L^3 \log T)$, where $T$ is the time horizon, $N$ is the number of arms and $L$ is the maximum number of arms in an action. We show that COMO-MAB has a wide range of applications, including recommending bundles of items to users and network routing, and focus on a resource-allocation application for multi-user communication in the presence of multidimensional performance metrics, where we show that our algorithm outperforms existing MAB algorithms.

MLMar 11, 2018
Multi-objective Contextual Bandit Problem with Similarity Information

Eralp Turğay, Doruk Öner, Cem Tekin

In this paper we propose the multi-objective contextual bandit problem with similarity information. This problem extends the classical contextual bandit problem with similarity information by introducing multiple and possibly conflicting objectives. Since the best arm in each objective can be different given the context, learning the best arm based on a single objective can jeopardize the rewards obtained from the other objectives. In order to evaluate the performance of the learner in this setup, we use a performance metric called the contextual Pareto regret. Essentially, the contextual Pareto regret is the sum of the distances of the arms chosen by the learner to the context dependent Pareto front. For this problem, we develop a new online learning algorithm called Pareto Contextual Zooming (PCZ), which exploits the idea of contextual zooming to learn the arms that are close to the Pareto front for each observed context by adaptively partitioning the joint context-arm set according to the observed rewards and locations of the context-arm pairs selected in the past. Then, we prove that PCZ achieves $\tilde O (T^{(1+d_p)/(2+d_p)})$ Pareto regret where $d_p$ is the Pareto zooming dimension that depends on the size of the set of near-optimal context-arm pairs. Moreover, we show that this regret bound is nearly optimal by providing an almost matching $Ω(T^{(1+d_p)/(2+d_p)})$ lower bound.

LGAug 18, 2017
Multi-objective Contextual Multi-armed Bandit with a Dominant Objective

Cem Tekin, Eralp Turgay

In this paper, we propose a new multi-objective contextual multi-armed bandit (MAB) problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective MAB problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives and the distribution of the reward depends on the context that is provided to the learner at the beginning of each round. We call this problem contextual multi-armed bandit with a dominant objective (CMAB-DO). In CMAB-DO, the goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its total reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. First, we show that the optimal arm lies in the Pareto front. Then, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and define two performance measures: the 2-dimensional (2D) regret and the Pareto regret. We show that both the 2D regret and the Pareto regret of MOC-MAB are sublinear in the number of rounds. We also compare the performance of the proposed algorithm with other state-of-the-art methods in synthetic and real-world datasets. The proposed model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.

LGJul 24, 2017
Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms: A Case with Bounded Regret

A. Ömer Sarıtaç, Cem Tekin

In this paper, we study the combinatorial multi-armed bandit problem (CMAB) with probabilistically triggered arms (PTAs). Under the assumption that the arm triggering probabilities (ATPs) are positive for all arms, we prove that a class of upper confidence bound (UCB) policies, named Combinatorial UCB with exploration rate $κ$ (CUCB-$κ$), and Combinatorial Thompson Sampling (CTS), which estimates the expected states of the arms via Thompson sampling, achieve bounded regret. In addition, we prove that CUCB-$0$ and CTS incur $O(\sqrt{T})$ gap-independent regret. These results improve the results in previous works, which show $O(\log T)$ gap-dependent and $O(\sqrt{T\log T})$ gap-independent regrets, respectively, under no assumptions on the ATPs. Then, we numerically evaluate the performance of CUCB-$κ$ and CTS in a real-world movie recommendation problem, where the actions correspond to recommending a set of movies, the arms correspond to the edges between the movies and the users, and the goal is to maximize the total number of users that are attracted by at least one movie. Our numerical results complement our theoretical findings on bounded regret. Apart from this problem, our results also directly apply to the online influence maximization (OIM) problem studied in numerous prior works.

LGMay 10, 2017
Context-Aware Hierarchical Online Learning for Performance Maximization in Mobile Crowdsourcing

Sabrina Klos, Cem Tekin, Mihaela van der Schaar et al.

In mobile crowdsourcing (MCS), mobile users accomplish outsourced human intelligence tasks. MCS requires an appropriate task assignment strategy, since different workers may have different performance in terms of acceptance rate and quality. Task assignment is challenging, since a worker's performance (i) may fluctuate, depending on both the worker's current personal context and the task context, (ii) is not known a priori, but has to be learned over time. Moreover, learning context-specific worker performance requires access to context information, which may not be available at a central entity due to communication overhead or privacy concerns. Additionally, evaluating worker performance might require costly quality assessments. In this paper, we propose a context-aware hierarchical online learning algorithm addressing the problem of performance maximization in MCS. In our algorithm, a local controller (LC) in the mobile device of a worker regularly observes the worker's context, her/his decisions to accept or decline tasks and the quality in completing tasks. Based on these observations, the LC regularly estimates the worker's context-specific performance. The mobile crowdsourcing platform (MCSP) then selects workers based on performance estimates received from the LCs. This hierarchical approach enables the LCs to learn context-specific worker performance and it enables the MCSP to select suitable workers. In addition, our algorithm preserves worker context locally, and it keeps the number of required quality assessments low. We prove that our algorithm converges to the optimal task assignment strategy. Moreover, the algorithm outperforms simpler task assignment strategies in experiments based on synthetic and real data.

LGMay 21, 2016
Gambler's Ruin Bandit Problem

Nima Akbarzadeh, Cem Tekin

In this paper, we propose a new multi-armed bandit problem called the Gambler's Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a sequence of rounds, where each round is a Markov Decision Process (MDP) with two actions (arms): a continuation action that moves the learner randomly over the state space around the current state; and a terminal action that moves the learner directly into one of the two terminal states (goal and dead-end state). The current round ends when a terminal state is reached, and the learner incurs a positive reward only when the goal state is reached. The objective of the learner is to maximize its long-term reward (expected number of times the goal state is reached), without having any prior knowledge on the state transition probabilities. We first prove a result on the form of the optimal policy for the GRBP. Then, we define the regret of the learner with respect to an omnipotent oracle, which acts optimally in each round, and prove that it increases logarithmically over rounds. We also identify a condition under which the learner's regret is bounded. A potential application of the GRBP is optimal medical treatment assignment, in which the continuation action corresponds to a conservative treatment and the terminal action corresponds to a risky treatment such as surgery.

LGDec 23, 2015
Adaptive Ensemble Learning with Confidence Bounds

Cem Tekin, Jinsung Yoon, Mihaela van der Schaar

Extracting actionable intelligence from distributed, heterogeneous, correlated and high-dimensional data sources requires run-time processing and learning both locally and globally. In the last decade, a large number of meta-learning techniques have been proposed in which local learners make online predictions based on their locally-collected data instances, and feed these predictions to an ensemble learner, which fuses them and issues a global prediction. However, most of these works do not provide performance guarantees or, when they do, these guarantees are asymptotic. None of these existing works provide confidence estimates about the issued predictions or rate of learning guarantees for the ensemble learner. In this paper, we provide a systematic ensemble learning method called Hedged Bandits, which comes with both long run (asymptotic) and short run (rate of learning) performance guarantees. Moreover, our approach yields performance guarantees with respect to the optimal local prediction strategy, and is also able to adapt its predictions in a data-driven manner. We illustrate the performance of Hedged Bandits in the context of medical informatics and show that it outperforms numerous online and offline ensemble learning methods.

LGAug 4, 2015
Episodic Multi-armed Bandits

Cem Tekin, Mihaela van der Schaar

We introduce a new class of reinforcement learning methods referred to as {\em episodic multi-armed bandits} (eMAB). In eMAB the learner proceeds in {\em episodes}, each composed of several {\em steps}, in which it chooses an action and observes a feedback signal. Moreover, in each step, it can take a special action, called the $stop$ action, that ends the current episode. After the $stop$ action is taken, the learner collects a terminal reward, and observes the costs and terminal rewards associated with each step of the episode. The goal of the learner is to maximize its cumulative gain (i.e., the terminal reward minus costs) over all episodes by learning to choose the best sequence of actions based on the feedback. First, we define an {\em oracle} benchmark, which sequentially selects the actions that maximize the expected immediate gain. Then, we propose our online learning algorithm, named {\em FeedBack Adaptive Learning} (FeedBAL), and prove that its regret with respect to the benchmark is bounded with high probability and increases logarithmically in expectation. Moreover, the regret only has polynomial dependence on the number of steps, actions and states. eMAB can be used to model applications that involve humans in the loop, ranging from personalized medical screening to personalized web-based education, where sequences of actions are taken in each episode, and optimal behavior requires adapting the chosen actions based on the feedback.

LGMar 29, 2015
Global Bandits

Onur Atan, Cem Tekin, Mihaela van der Schaar

Multi-armed bandits (MAB) model sequential decision making problems, in which a learner sequentially chooses arms with unknown reward distributions in order to maximize its cumulative reward. Most of the prior work on MAB assumes that the reward distributions of each arm are independent. But in a wide variety of decision problems -- from drug dosage to dynamic pricing -- the expected rewards of different arms are correlated, so that selecting one arm provides information about the expected rewards of other arms as well. We propose and analyze a class of models of such decision problems, which we call {\em global bandits}. In the case in which rewards of all arms are deterministic functions of a single unknown parameter, we construct a greedy policy that achieves {\em bounded regret}, with a bound that depends on the single true parameter of the problem. Hence, this policy selects suboptimal arms only finitely many times with probability one. For this case we also obtain a bound on regret that is {\em independent of the true parameter}; this bound is sub-linear, with an exponent that depends on the informativeness of the arms. We also propose a variant of the greedy policy that achieves $\tilde{\mathcal{O}}(\sqrt{T})$ worst-case and $\mathcal{O}(1)$ parameter dependent regret. Finally, we perform experiments on dynamic pricing and show that the proposed algorithms achieve significant gains with respect to the well-known benchmarks.

MMFeb 7, 2015
Contextual Online Learning for Multimedia Content Aggregation

Cem Tekin, Mihaela van der Schaar

The last decade has witnessed a tremendous growth in the volume as well as the diversity of multimedia content generated by a multitude of sources (news agencies, social media, etc.). Faced with a variety of content choices, consumers are exhibiting diverse preferences for content; their preferences often depend on the context in which they consume content as well as various exogenous events. To satisfy the consumers' demand for such diverse content, multimedia content aggregators (CAs) have emerged which gather content from numerous multimedia sources. A key challenge for such systems is to accurately predict what type of content each of its consumers prefers in a certain context, and adapt these predictions to the evolving consumers' preferences, contexts and content characteristics. We propose a novel, distributed, online multimedia content aggregation framework, which gathers content generated by multiple heterogeneous producers to fulfill its consumers' demand for content. Since both the multimedia content characteristics and the consumers' preferences and contexts are unknown, the optimal content aggregation strategy is unknown a priori. Our proposed content aggregation algorithm is able to learn online what content to gather and how to match content and users by exploiting similarities between consumer types. We prove bounds for our proposed learning algorithms that guarantee both the accuracy of the predictions as well as the learning speed. Importantly, our algorithms operate efficiently even when feedback from consumers is missing or content and preferences evolve over time. Illustrative results highlight the merits of the proposed content aggregation system in a variety of settings.

LGFeb 5, 2015
RELEAF: An Algorithm for Learning and Exploiting Relevance

Cem Tekin, Mihaela van der Schaar

Recommender systems, medical diagnosis, network security, etc., require on-going learning and decision-making in real time. These -- and many others -- represent perfect examples of the opportunities and difficulties presented by Big Data: the available information often arrives from a variety of sources and has diverse features so that learning from all the sources may be valuable but integrating what is learned is subject to the curse of dimensionality. This paper develops and analyzes algorithms that allow efficient learning and decision-making while avoiding the curse of dimensionality. We formalize the information available to the learner/decision-maker at a particular time as a context vector which the learner should consider when taking actions. In general the context vector is very high dimensional, but in many settings, the most relevant information is embedded into only a few relevant dimensions. If these relevant dimensions were known in advance, the problem would be simple -- but they are not. Moreover, the relevant dimensions may be different for different actions. Our algorithm learns the relevant dimensions for each action, and makes decisions based in what it has learned. Formally, we build on the structure of a contextual multi-armed bandit by adding and exploiting a relevance relation. We prove a general regret bound for our algorithm whose time order depends only on the maximum number of relevant dimensions among all the actions, which in the special case where the relevance relation is single-valued (a function), reduces to $\tilde{O}(T^{2(\sqrt{2}-1)})$; in the absence of a relevance relation, the best known contextual bandit algorithms achieve regret $\tilde{O}(T^{(D+1)/(D+2)})$, where $D$ is the full dimension of the context vector.

ITNov 13, 2014
Jamming Bandits

SaiDhiraj Amuru, Cem Tekin, Mihaela van der Schaar et al.

Can an intelligent jammer learn and adapt to unknown environments in an electronic warfare-type scenario? In this paper, we answer this question in the positive, by developing a cognitive jammer that adaptively and optimally disrupts the communication between a victim transmitter-receiver pair. We formalize the problem using a novel multi-armed bandit framework where the jammer can choose various physical layer parameters such as the signaling scheme, power level and the on-off/pulsing duration in an attempt to obtain power efficient jamming strategies. We first present novel online learning algorithms to maximize the jamming efficacy against static transmitter-receiver pairs and prove that our learning algorithm converges to the optimal (in terms of the error rate inflicted at the victim and the energy used) jamming strategy. Even more importantly, we prove that the rate of convergence to the optimal jamming strategy is sub-linear, i.e. the learning is fast in comparison to existing reinforcement learning algorithms, which is particularly important in dynamically changing wireless environments. Also, we characterize the performance of the proposed bandit-based learning algorithm against multiple static and adaptive transmitter-receiver pairs.

LGOct 29, 2014
Global Bandits with Holder Continuity

Onur Atan, Cem Tekin, Mihaela van der Schaar

Standard Multi-Armed Bandit (MAB) problems assume that the arms are independent. However, in many application scenarios, the information obtained by playing an arm provides information about the remainder of the arms. Hence, in such applications, this informativeness can and should be exploited to enable faster convergence to the optimal solution. In this paper, we introduce and formalize the Global MAB (GMAB), in which arms are globally informative through a global parameter, i.e., choosing an arm reveals information about all the arms. We propose a greedy policy for the GMAB which always selects the arm with the highest estimated expected reward, and prove that it achieves bounded parameter-dependent regret. Hence, this policy selects suboptimal arms only finitely many times, and after a finite number of initial time steps, the optimal arm is selected in all of the remaining time steps with probability one. In addition, we also study how the informativeness of the arms about each other's rewards affects the speed of learning. Specifically, we prove that the parameter-free (worst-case) regret is sublinear in time, and decreases with the informativeness of the arms. We also prove a sublinear in time Bayesian risk bound for the GMAB which reduces to the well-known Bayesian risk bound for linearly parameterized bandits when the arms are fully informative. GMABs have applications ranging from drug and treatment discovery to dynamic pricing.

SISep 26, 2013
Distributed Online Learning in Social Recommender Systems

Cem Tekin, Simpson Zhang, Mihaela van der Schaar

In this paper, we consider decentralized sequential decision making in distributed online recommender systems, where items are recommended to users based on their search query as well as their specific background including history of bought items, gender and age, all of which comprise the context information of the user. In contrast to centralized recommender systems, in which there is a single centralized seller who has access to the complete inventory of items as well as the complete record of sales and user information, in decentralized recommender systems each seller/learner only has access to the inventory of items and user information for its own products and not the products and user information of other sellers, but can get commission if it sells an item of another seller. Therefore the sellers must distributedly find out for an incoming user which items to recommend (from the set of own items or items of another seller), in order to maximize the revenue from own sales and commissions. We formulate this problem as a cooperative contextual bandit problem, analytically bound the performance of the sellers compared to the best recommendation strategy given the complete realization of user arrivals and the inventory of items, as well as the context-dependent purchase probabilities of each item, and verify our results via numerical examples on a distributed data set adapted based on Amazon data. We evaluate the dependence of the performance of a seller on the inventory of items the seller has, the number of connections it has with the other sellers, and the commissions which the seller gets by selling items of other sellers to its users.

LGAug 21, 2013
Distributed Online Learning via Cooperative Contextual Bandits

Cem Tekin, Mihaela van der Schaar

In this paper we propose a novel framework for decentralized, online learning by many learners. At each moment of time, an instance characterized by a certain context may arrive to each learner; based on the context, the learner can select one of its own actions (which gives a reward and provides information) or request assistance from another learner. In the latter case, the requester pays a cost and receives the reward but the provider learns the information. In our framework, learners are modeled as cooperative contextual bandits. Each learner seeks to maximize the expected reward from its arrivals, which involves trading off the reward received from its own actions, the information learned from its own actions, the reward received from the actions requested of others and the cost paid for these actions - taking into account what it has learned about the value of assistance from each other learner. We develop distributed online learning algorithms and provide analytic bounds to compare the efficiency of these with algorithms with the complete knowledge (oracle) benchmark (in which the expected reward of every action in every context is known by every learner). Our estimates show that regret - the loss incurred by the algorithm - is sublinear in time. Our theoretical framework can be used in many practical applications including Big Data mining, event detection in surveillance sensor networks and distributed online recommendation systems.

LGAug 21, 2013
Decentralized Online Big Data Classification - a Bandit Framework

Cem Tekin, Mihaela van der Schaar

Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distributed data sources and processed by a heterogeneous set of distributed learners which learn online, at run-time, how to classify the different data streams either by using their locally available classification functions or by helping each other by classifying each other's data. Importantly, since the data is gathered at different locations, sending the data to another learner to process incurs additional costs such as delays, and hence this will be only beneficial if the benefits obtained from a better classification will exceed the costs. We assume that the classification functions available to each processing element are fixed, but their prediction accuracy for various types of incoming data are unknown and can change dynamically over time, and thus they need to be learned online. We model the problem of joint classification by the distributed and heterogeneous learners from multiple data sources as a distributed contextual bandit problem where each data is characterized by a specific context. We develop distributed online learning algorithms for which we can prove that they have sublinear regret. Compared to prior work in distributed online data mining, our work is the first to provide analytic regret results characterizing the performance of the proposed algorithms.

LGJul 2, 2013
Distributed Online Big Data Classification Using Context Information

Cem Tekin, Mihaela van der Schaar

Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distributed data sources and processed by a heterogeneous set of distributed learners which learn online, at run-time, how to classify the different data streams either by using their locally available classification functions or by helping each other by classifying each other's data. Importantly, since the data is gathered at different locations, sending the data to another learner to process incurs additional costs such as delays, and hence this will be only beneficial if the benefits obtained from a better classification will exceed the costs. We model the problem of joint classification by the distributed and heterogeneous learners from multiple data sources as a distributed contextual bandit problem where each data is characterized by a specific context. We develop a distributed online learning algorithm for which we can prove sublinear regret. Compared to prior work in distributed online data mining, our work is the first to provide analytic regret results characterizing the performance of the proposed algorithm.

LGMay 15, 2013
Online Learning in a Contract Selection Problem

Cem Tekin, Mingyan Liu

In an online contract selection problem there is a seller which offers a set of contracts to sequentially arriving buyers whose types are drawn from an unknown distribution. If there exists a profitable contract for the buyer in the offered set, i.e., a contract with payoff higher than the payoff of not accepting any contracts, the buyer chooses the contract that maximizes its payoff. In this paper we consider the online contract selection problem to maximize the sellers profit. Assuming that a structural property called ordered preferences holds for the buyer's payoff function, we propose online learning algorithms that have sub-linear regret with respect to the best set of contracts given the distribution over the buyer's type. This problem has many applications including spectrum contracts, wireless service provider data plans and recommendation systems.

LGOct 19, 2012
Online Learning in Decentralized Multiuser Resource Sharing Problems

Cem Tekin, Mingyan Liu

In this paper, we consider the general scenario of resource sharing in a decentralized system when the resource rewards/qualities are time-varying and unknown to the users, and using the same resource by multiple users leads to reduced quality due to resource sharing. Firstly, we consider a user-independent reward model with no communication between the users, where a user gets feedback about the congestion level in the resource it uses. Secondly, we consider user-specific rewards and allow costly communication between the users. The users have a cooperative goal of achieving the highest system utility. There are multiple obstacles in achieving this goal such as the decentralized nature of the system, unknown resource qualities, communication, computation and switching costs. We propose distributed learning algorithms with logarithmic regret with respect to the optimal allocation. Our logarithmic regret result holds under both i.i.d. and Markovian reward models, as well as under communication, computation and switching costs.