Setareh Maghsudi

h-index17

37papers

435citations

Novelty52%

AI Score57

Ranked #6,138 of 194,257 authors (top 3%)#1,650 in LG (top 4%)

37 Papers

6.6LGJun 29, 2023Code

Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning

Qiang He, Tianyi Zhou, Meng Fang et al. · uw

We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.

13.7LGFeb 7, 2023Code

Robustness Implies Fairness in Causal Algorithmic Recourse

Ahmad-Reza Ehyaei, Amir-Hossein Karimi, Bernhard Schölkopf et al. · eth-zurich

Algorithmic recourse aims to disclose the inner workings of the black-box decision process in situations where decisions have significant consequences, by providing recommendations to empower beneficiaries to achieve a more favorable outcome. To ensure an effective remedy, suggested interventions must not only be low-cost but also robust and fair. This goal is accomplished by providing similar explanations to individuals who are alike. This study explores the concept of individual fairness and adversarial robustness in causal algorithmic recourse and addresses the challenge of achieving both. To resolve the challenges, we propose a new framework for defining adversarially robust recourse. The new setting views the protected feature as a pseudometric and demonstrates that individual fairness is a special case of adversarial robustness. Finally, we introduce the fair robust recourse problem to achieve both desirable properties and show how it can be satisfied both theoretically and empirically.

3.8LGAug 31, 2023

Federated Learning in UAV-Enhanced Networks: Joint Coverage and Convergence Time Optimization

Mariam Yahya, Setareh Maghsudi, Slawomir Stanczak

Federated learning (FL) involves several devices that collaboratively train a shared model without transferring their local data. FL reduces the communication overhead, making it a promising learning method in UAV-enhanced wireless networks with scarce energy resources. Despite the potential, implementing FL in UAV-enhanced networks is challenging, as conventional UAV placement methods that maximize coverage increase the FL delay significantly. Moreover, the uncertainty and lack of a priori information about crucial variables, such as channel quality, exacerbate the problem. In this paper, we first analyze the statistical characteristics of a UAV-enhanced wireless sensor network (WSN) with energy harvesting. We then develop a model and solution based on the multi-objective multi-armed bandit theory to maximize the network coverage while minimizing the FL delay. Besides, we propose another solution that is particularly useful with large action sets and strict energy constraints at the UAVs. Our proposal uses a scalarized best-arm identification algorithm to find the optimal arms that maximize the ratio of the expected reward to the expected energy cost by sequentially eliminating one or more arms in each round. Then, we derive the upper bound on the error probability of our multi-objective and cost-aware algorithm. Numerical results show the effectiveness of our approach.

3.3ITMay 19, 2022

A Learning-Based Approach to Approximate Coded Computation

Navneet Agrawal, Yuqin Qiu, Matthias Frey et al.

Lagrange coded computation (LCC) is essential to solving problems about matrix polynomials in a coded distributed fashion; nevertheless, it can only solve the problems that are representable as matrix polynomials. In this paper, we propose AICC, an AI-aided learning approach that is inspired by LCC but also uses deep neural networks (DNNs). It is appropriate for coded computation of more general functions. Numerical simulations demonstrate the suitability of the proposed approach for the coded computation of different matrix functions that are often utilized in digital signal processing.

7.9AISep 7, 2023Code

Learning of Generalizable and Interpretable Knowledge in Grid-Based Reinforcement Learning Environments

Manuel Eberhardinger, Johannes Maucher, Setareh Maghsudi

Understanding the interactions of agents trained with deep reinforcement learning is crucial for deploying agents in games or the real world. In the former, unreasonable actions confuse players. In the latter, that effect is even more significant, as unexpected behavior cause accidents with potentially grave and long-lasting consequences for the involved individuals. In this work, we propose using program synthesis to imitate reinforcement learning policies after seeing a trajectory of the action sequence. Programs have the advantage that they are inherently interpretable and verifiable for correctness. We adapt the state-of-the-art program synthesis system DreamCoder for learning concepts in grid-based environments, specifically, a navigation task and two miniature versions of Atari games, Space Invaders and Asterix. By inspecting the generated libraries, we can make inferences about the concepts the black-box agent has learned and better understand the agent's behavior. We achieve the same by visualizing the agent's decision-making process for the imitated sequences. We evaluate our approach with different types of program synthesizers based on a search-only method, a neural-guided search, and a language model fine-tuned on code.

5.3LGFeb 10, 2023Code

Piecewise-Stationary Multi-Objective Multi-Armed Bandit with Application to Joint Communications and Sensing

Amir Rezaei Balef, Setareh Maghsudi

We study a multi-objective multi-armed bandit problem in a dynamic environment. The problem portrays a decision-maker that sequentially selects an arm from a given set. If selected, each action produces a reward vector, where every element follows a piecewise-stationary Bernoulli distribution. The agent aims at choosing an arm among the Pareto optimal set of arms to minimize its regret. We propose a Pareto generic upper confidence bound (UCB)-based algorithm with change detection to solve this problem. By developing the essential inequalities for multi-dimensional spaces, we establish that our proposal guarantees a regret bound in the order of $γ_T\log(T/{γ_T})$ when the number of breakpoints $γ_T$ is known. Without this assumption, the regret bound of our algorithm is $γ_T\log(T)$. Finally, we formulate an energy-efficient waveform design problem in an integrated communication and sensing system as a toy example. Numerical experiments on the toy example and synthetic and real-world datasets demonstrate the efficiency of our policy compared to the current methods.

5.8LGDec 25, 2022

Linear Combinatorial Semi-Bandit with Causally Related Rewards

Behzad Nourani-Koliji, Saeed Ghoorchian, Setareh Maghsudi

In a sequential decision-making problem, having a structural dependency amongst the reward distributions associated with the arms makes it challenging to identify a subset of alternatives that guarantees the optimal collective outcome. Thus, besides individual actions' reward, learning the causal relations is essential to improve the decision-making strategy. To solve the two-fold learning problem described above, we develop the 'combinatorial semi-bandit framework with causally related rewards', where we model the causal relations by a directed graph in a stationary structural equation model. The nodal observation in the graph signal comprises the corresponding base arm's instantaneous reward and an additional term resulting from the causal influences of other base arms' rewards. The objective is to maximize the long-term average payoff, which is a linear function of the base arms' rewards and depends strongly on the network topology. To achieve this objective, we propose a policy that determines the causal relations by learning the network's topology and simultaneously exploits this knowledge to optimize the decision-making process. We establish a sublinear regret bound for the proposed algorithm. Numerical experiments using synthetic and real-world datasets demonstrate the superior performance of our proposed method compared to several benchmarks.

2.3MAMar 28, 2022

Distributed Task Management in Fog Computing: A Socially Concave Bandit Game

Xiaotong Cheng, Setareh Maghsudi

Fog computing leverages the task offloading capabilities at the network's edge to improve efficiency and enable swift responses to application demands. However, the design of task allocation strategies in a fog computing network is still challenging because of the heterogeneity of fog nodes and uncertainties in system dynamics. We formulate the distributed task allocation problem as a social-concave game with bandit feedback and show that the game has a unique Nash equilibrium, which is implementable using no-regret learning strategies (regret with sublinear growth). We then develop two no-regret online decision-making strategies. One strategy, namely bandit gradient ascent with momentum, is an online convex optimization algorithm with bandit feedback. The other strategy, Lipschitz bandit with initialization, is an EXP3 multi-armed bandit algorithm. We establish regret bounds for both strategies and analyze their convergence characteristics. Moreover, we compare the proposed strategies with an allocation strategy named learning with linear rewards. Theoretical- and numerical analysis shows the superior performance of the proposed strategies for efficient task allocation compared to the state-of-the-art methods.

5.8AIJul 20, 2024Code

Unveiling the Decision-Making Process in Reinforcement Learning with Genetic Programming

Manuel Eberhardinger, Florian Rupp, Johannes Maucher et al.

Despite tremendous progress, machine learning and deep learning still suffer from incomprehensible predictions. Incomprehensibility, however, is not an option for the use of (deep) reinforcement learning in the real world, as unpredictable actions can seriously harm the involved individuals. In this work, we propose a genetic programming framework to generate explanations for the decision-making process of already trained agents by imitating them with programs. Programs are interpretable and can be executed to generate explanations of why the agent chooses a particular action. Furthermore, we conduct an ablation study that investigates how extending the domain-specific language by using library learning alters the performance of the method. We compare our results with the previous state of the art for this problem and show that we are comparable in performance but require much less hardware resources and computation time.

3.8LGJul 26, 2023

Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related Rewards

Behzad Nourani-Koliji, Steven Bilaj, Amir Rezaei Balef et al.

We study the piecewise stationary combinatorial semi-bandit problem with causally related rewards. In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process. In such an environment, an optimal decision-maker must follow both sources of change and adapt accordingly. The problem becomes aggravated in the combinatorial semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms. The core of our proposed policy is the Upper Confidence Bound (UCB) algorithm. We assume the agent relies on an adaptive approach to overcome the challenge. More specifically, it employs a change-point detector based on the Generalized Likelihood Ratio (GLR) test. Besides, we introduce the notion of group restart as a new alternative restarting strategy in the decision making process in structured environments. Finally, our algorithm integrates a mechanism to trace the variations of the underlying graph structure, which captures the causal relationships between the rewards in the bandit setting. Theoretically, we establish a regret upper bound that reflects the effects of the number of structural- and distribution changes on the performance. The outcome of our numerical experiments in real-world scenarios exhibits applicability and superior performance of our proposal compared to the state-of-the-art benchmarks.

5.3LGJun 9, 2023

Distributed Consensus Algorithm for Decision-Making in Multi-agent Multi-armed Bandit

Xiaotong Cheng, Setareh Maghsudi

We study a structured multi-agent multi-armed bandit (MAMAB) problem in a dynamic environment. A graph reflects the information-sharing structure among agents, and the arms' reward distributions are piecewise-stationary with several unknown change points. The agents face the identical piecewise-stationary MAB problem. The goal is to develop a decision-making policy for the agents that minimizes the regret, which is the expected total loss of not playing the optimal arm at each time step. Our proposed solution, Restarted Bayesian Online Change Point Detection in Cooperative Upper Confidence Bound Algorithm (RBO-Coop-UCB), involves an efficient multi-agent UCB algorithm as its core enhanced with a Bayesian change point detector. We also develop a simple restart decision cooperation that improves decision-making. Theoretically, we establish that the expected group regret of RBO-Coop-UCB is upper bounded by $\mathcal{O}(KNM\log T + K\sqrt{MT\log T})$, where K is the number of agents, M is the number of arms, and T is the number of time steps. Numerical experiments on synthetic and real-world datasets demonstrate that our proposed method outperforms the state-of-the-art algorithms.

4.6LGNov 14, 2022

Hypothesis Transfer in Bandits by Weighted Models

Steven Bilaj, Sofien Dhouib, Setareh Maghsudi

We consider the problem of contextual multi-armed bandits in the setting of hypothesis transfer learning. That is, we assume having access to a previously learned model on an unobserved set of contexts, and we leverage it in order to accelerate exploration on a new bandit problem. Our transfer strategy is based on a re-weighting scheme for which we show a reduction in the regret over the classic Linear UCB when transfer is desired, while recovering the classic regret rate when the two tasks are unrelated. We further extend this method to an arbitrary amount of source models, where the algorithm decides which model is preferred at each time step. Additionally we discuss an approach where a dynamic convex combination of source models is given in terms of a biased regularization term in the classic LinUCB algorithm. The algorithms and the theoretical analysis of our proposed methods substantiated by empirical evaluations on simulated and real-world data.

4.6LGMar 9, 2022

Connecting sufficient conditions for domain adaptation: source-guided uncertainty, relaxed divergences and discrepancy localization

Sofien Dhouib, Setareh Maghsudi

Recent advances in domain adaptation establish that requiring a low risk on the source domain and equal feature marginals degrade the adaptation's performance. At the same time, empirical evidence shows that incorporating an unsupervised target domain term that pushes decision boundaries away from the high-density regions, along with relaxed alignment, improves adaptation. In this paper, we theoretically justify such observations via a new bound on the target risk, and we connect two notions of relaxation for divergence, namely $β-$relaxed divergences and localization. This connection allows us to incorporate the source domain's categorical structure into the relaxation of the considered divergence, provably resulting in a better handling of the label shift case in particular.

3.8LGJul 18, 2023Code

Online Learning with Costly Features in Non-stationary Environments

Saeed Ghoorchian, Evgenii Kortukov, Setareh Maghsudi

Maximizing long-term rewards is the primary goal in sequential decision-making problems. The majority of existing methods assume that side information is freely available, enabling the learning agent to observe all features' states before making a decision. In real-world problems, however, collecting beneficial information is often costly. That implies that, besides individual arms' reward, learning the observations of the features' states is essential to improve the decision-making strategy. The problem is aggravated in a non-stationary environment where reward and cost distributions undergo abrupt changes over time. To address the aforementioned dual learning problem, we extend the contextual bandit setting and allow the agent to observe subsets of features' states. The objective is to maximize the long-term average gain, which is the difference between the accumulated rewards and the paid costs on average. Therefore, the agent faces a trade-off between minimizing the cost of information acquisition and possibly improving the decision-making process using the obtained information. To this end, we develop an algorithm that guarantees a sublinear regret in time. Numerical results demonstrate the superiority of our proposed policy in a real-world scenario.

2.0LGJul 18, 2023

Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards

Saeed Ghoorchian, Setareh Maghsudi

Sequential decision-making under uncertainty is often associated with long feedback delays. Such delays degrade the performance of the learning agent in identifying a subset of arms with the optimal collective reward in the long run. This problem becomes significantly challenging in a non-stationary environment with structural dependencies amongst the reward distributions associated with the arms. Therefore, besides adapting to delays and environmental changes, learning the causal relations alleviates the adverse effects of feedback delay on the decision-making process. We formalize the described setting as a non-stationary and delayed combinatorial semi-bandit problem with causally related rewards. We model the causal relations by a directed graph in a stationary structural equation model. The agent maximizes the long-term average payoff, defined as a linear function of the base arms' rewards. We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making while adapting to drifts. We prove a regret bound for the performance of the proposed algorithm. Besides, we evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.

1.4LGFeb 12

Robust Optimization Approach and Learning Based Hide-and-Seek Game for Resilient Network Design

Mohammad Khosravi, Setareh Maghsudi

We study the design of resilient and reliable communication networks in which a signal can be transferred only up to a limited distance before its quality falls below an acceptable threshold. When excessive signal degradation occurs, regeneration is required through regenerators installed at selected network nodes. In this work, both network links and nodes are subject to uncertainty. The installation costs of regenerators are modeled using a budgeted uncertainty set. In addition, link lengths follow a dynamic budgeted uncertainty set introduced in this paper, where deviations may vary over time. Robust optimization seeks solutions whose performance is guaranteed under all scenarios represented by the underlying uncertainty set. Accordingly, the objective is to identify a minimum-cost subset of nodes for regenerator deployment that ensures full network connectivity, even under the worst possible realizations of uncertainty. To solve the problem, we first formulate it within a robust optimization framework, and then develop scalable solution methods based on column-and-constraint generation, Benders decomposition, and iterative robust optimization. In addition, we formulate a learning-based hide-and-seek game to further analyze the problem structure. The proposed approaches are evaluated against classical static budgeted robust models and deterministic worst-case formulations. Both theoretical analysis and computational results demonstrate the effectiveness and advantages of our methodology.

6.0SPMar 17

Optimal Radio Resource Management for ISAC Under Imperfect Information: A Resource Economy-Driven Perspective

Luis F. Abanto-Leon, Setareh Maghsudi

This work investigates the radio resource management (RRM) design for downlink integrated sensing and communications (ISAC) systems, jointly optimizing timeslot allocation, beam adaptation, functionality selection, and user-target pairing, with the goal of economizing resource consumption under imperfect information. Timeslot allocation assigns a number of discrete channel uses to targets and users, while beam adaptation selects transmit and receive beams with suitable directions, power levels, and beamwidths. Functionality selection determines whether each timeslot is used for sensing, communication, or their simultaneous operation, while user-target pairing specifies which users and targets are jointly served within the same timeslot. To ensure reliable operation, information imperfections arising from motion, quantization, feedback delays, and hardware limitations are considered. Resource economization is achieved by minimizing energy and time consumption through a multi-objective function, with strict prioritization of time savings. The resulting RRM problem is formulated as a semi-infinite, nonconvex mixed-integer nonlinear program (MINLP). Given the lack of generic methods for solving such problems, we propose a tailor-made approach that exploits the underlying structure of the problem to uncover hidden convexities. This enables an exact reformulation as a mixed-integer semidefinite program (MISDP), which can be solved to global optimality. Simulations reveal important interdependencies among the considered RRM components and show that the proposed approach achieves substantial performance improvements over baseline schemes, with gains up to 88%.

6.9LGApr 6

One Model for All: Multi-Objective Controllable Language Models

Qiang He, Yucheng Yang, Tianyi Zhou et al.

Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.

10.4LGApr 19, 2024Code

Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation

Qiang He, Tianyi Zhou, Meng Fang et al.

Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at https://github.com/sweetice/BEER-ICLR2024.

23.3LGAug 11, 2025

Pareto Multi-Objective Alignment for Language Models

Qiang He, Setareh Maghsudi

Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2*d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA's robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments.

7.9LGMar 31, 2024

Meta Learning in Bandits within Shared Affine Subspaces

Steven Bilaj, Sofien Dhouib, Setareh Maghsudi

We study the problem of meta-learning several contextual stochastic bandits tasks by leveraging their concentration around a low-dimensional affine subspace, which we learn via online principal component analysis to reduce the expected regret over the encountered bandits. We propose and theoretically analyze two strategies that solve the problem: One based on the principle of optimism in the face of uncertainty and the other via Thompson sampling. Our framework is generic and includes previously proposed approaches as special cases. Besides, the empirical results show that our methods significantly reduce the regret on several bandit tasks.

4.1LGDec 5, 2025

Meta-Learning Multi-armed Bandits for Beam Tracking in 5G and 6G Networks

Alexander Mattick, George Yammine, Georgios Kontes et al.

Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects \textit{optimal} beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams make an optimal beam selection challenging. In contrast to previous work and standardization efforts that opt for supervised learning to train classifiers to predict the next best beam based on previously selected beams we formulate the problem as a partially observable Markov decision process (POMDP) and model the environment as the codebook itself. At each time step, we select a candidate beam conditioned on the belief state of the unobservable optimal beam and previously probed beams. This frames the beam selection problem as an online search procedure that locates the moving optimal beam. In contrast to previous work, our method handles new or unforeseen trajectories and changes in the physical environment, and outperforms previous work by orders of magnitude.

1.2MAAug 4, 2025

Emergence of Fair Leaders via Mediators in Multi-Agent Reinforcement Learning

Akshay Dodwadmath, Setareh Maghsudi

Stackelberg games and their resulting equilibria have received increasing attention in the multi-agent reinforcement learning literature. Each stage of a traditional Stackelberg game involves a leader(s) acting first, followed by the followers. In situations where the roles of leader(s) and followers can be interchanged, the designated role can have considerable advantages, for example, in first-mover advantage settings. Then the question arises: Who should be the leader and when? A bias in the leader selection process can lead to unfair outcomes. This problem is aggravated if the agents are self-interested and care only about their goals and rewards. We formally define this leader selection problem and show its relation to fairness in agents' returns. Furthermore, we propose a multi-agent reinforcement learning framework that maximizes fairness by integrating mediators. Mediators have previously been used in the simultaneous action setting with varying levels of control, such as directly performing agents' actions or just recommending them. Our framework integrates mediators in the Stackelberg setting with minimal control (leader selection). We show that the presence of mediators leads to self-interested agents taking fair actions, resulting in higher overall fairness in agents' returns.

1.2NIJun 22, 2025

Service Placement in Small Cell Networks Using Distributed Best Arm Identification in Linear Bandits

Mariam Yahya, Aydin Sezgin, Setareh Maghsudi

As users in small cell networks increasingly rely on computation-intensive services, cloud-based access often results in high latency. Multi-access edge computing (MEC) mitigates this by bringing computational resources closer to end users, with small base stations (SBSs) serving as edge servers to enable low-latency service delivery. However, limited edge capacity makes it challenging to decide which services to deploy locally versus in the cloud, especially under unknown service demand and dynamic network conditions. To tackle this problem, we model service demand as a linear function of service attributes and formulate the service placement task as a linear bandit problem, where SBSs act as agents and services as arms. The goal is to identify the service that, when placed at the edge, offers the greatest reduction in total user delay compared to cloud deployment. We propose a distributed and adaptive multi-agent best-arm identification (BAI) algorithm under a fixed-confidence setting, where SBSs collaborate to accelerate learning. Simulations show that our algorithm identifies the optimal service with the desired confidence and achieves near-optimal speedup, as the number of learning rounds decreases proportionally with the number of SBSs. We also provide theoretical analysis of the algorithm's sample complexity and communication overhead.

4.1LGJun 16, 2025

Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Mansoor Davoodi, Setareh Maghsudi

Multi-armed bandit (MAB) problems are widely applied to online optimization tasks that require balancing exploration and exploitation. In practical scenarios, these tasks often involve multiple conflicting objectives, giving rise to multi-objective multi-armed bandits (MO-MAB). Existing MO-MAB approaches predominantly rely on the Pareto regret metric introduced in \cite{drugan2013designing}. However, this metric has notable limitations, particularly in accounting for all Pareto-optimal arms simultaneously. To address these challenges, we propose a novel and comprehensive regret metric that ensures balanced performance across conflicting objectives. Additionally, we introduce the concept of \textit{Efficient Pareto-Optimal} arms, which are specifically designed for online optimization. Based on our new metric, we develop a two-phase MO-MAB algorithm that achieves sublinear regret for both Pareto-optimal and efficient Pareto-optimal arms.

7.1LGMar 6, 2025

Quantum-Inspired Reinforcement Learning in the Presence of Epistemic Ambivalence

Alireza Habibi, Saeed Ghoorchian, Setareh Maghsudi

The complexity of online decision-making under uncertainty stems from the requirement of finding a balance between exploiting known strategies and exploring new possibilities. Naturally, the uncertainty type plays a crucial role in developing decision-making strategies that manage complexity effectively. In this paper, we focus on a specific form of uncertainty known as epistemic ambivalence (EA), which emerges from conflicting pieces of evidence or contradictory experiences. It creates a delicate interplay between uncertainty and confidence, distinguishing it from epistemic uncertainty that typically diminishes with new information. Indeed, ambivalence can persist even after additional knowledge is acquired. To address this phenomenon, we propose a novel framework, called the epistemically ambivalent Markov decision process (EA-MDP), aiming to understand and control EA in decision-making processes. This framework incorporates the concept of a quantum state from the quantum mechanics formalism, and its core is to assess the probability and reward of every possible outcome. We calculate the reward function using quantum measurement techniques and prove the existence of an optimal policy and an optimal value function in the EA-MDP framework. We also propose the EA-epsilon-greedy Q-learning algorithm. To evaluate the impact of EA on decision-making and the expedience of our framework, we study two distinct experimental setups, namely the two-state problem and the lattice problem. Our results show that using our methods, the agent converges to the optimal policy in the presence of EA.

5.8AIDec 5, 2024

From Code to Play: Benchmarking Program Search for Games Using Large Language Models

Manuel Eberhardinger, James Goodman, Alexander Dockhorn et al.

Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the task than on model size. While larger models generate more executable programs, these do not always result in higher-quality solutions but are much more expensive. No model has a clear advantage, although on any specific task, one model may be better. Trying many models on a problem and using the best results across them is more reliable than using just one.

2.3DCJun 24, 2024

Decentralized Task Offloading and Load-Balancing for Mobile Edge Computing in Dense Networks

Mariam Yahya, Alexander Conzelmann, Setareh Maghsudi

We study the problem of decentralized task offloading and load-balancing in a dense network with numerous devices and a set of edge servers. Solving this problem optimally is complicated due to the unknown network information and random task sizes. The shared network resources also influence the users' decisions and resource distribution. Our solution combines the mean field multi-agent multi-armed bandit (MAB) game with a load-balancing technique that adjusts the servers' rewards to achieve a target population profile despite the distributed user decision-making. Numerical results demonstrate the efficacy of our approach and the convergence to the target load distribution.

2.6LGMay 19, 2024

Budgeted Recommendation with Delayed Feedback

Kweiguu Liu, Setareh Maghsudi

In a conventional contextual multi-armed bandit problem, the feedback (or reward) is immediately observable after an action. Nevertheless, delayed feedback arises in numerous real-life situations and is particularly crucial in time-sensitive applications. The exploration-exploitation dilemma becomes particularly challenging under such conditions, as it couples with the interplay between delays and limited resources. Besides, a limited budget often aggravates the problem by restricting the exploration potential. A motivating example is the distribution of medical supplies at the early stage of COVID-19. The delayed feedback of testing results, thus insufficient information for learning, degraded the efficiency of resource allocation. Motivated by such applications, we study the effect of delayed feedback on constrained contextual bandits. We develop a decision-making policy, delay-oriented resource allocation with learning (DORAL), to optimize the resource expenditure in a contextual multi-armed bandit problem with arm-dependent delayed feedback.

2.0LGMay 30, 2023

Cooperative Thresholded Lasso for Sparse Linear Bandit

Haniyeh Barghi, Xiaotong Cheng, Setareh Maghsudi

We present a novel approach to address the multi-agent sparse contextual linear bandit problem, in which the feature vectors have a high dimension $d$ whereas the reward function depends on only a limited set of features - precisely $s_0 \ll d$. Furthermore, the learning follows under information-sharing constraints. The proposed method employs Lasso regression for dimension reduction, allowing each agent to independently estimate an approximate set of main dimensions and share that information with others depending on the network's structure. The information is then aggregated through a specific process and shared with all agents. Each agent then resolves the problem with ridge regression focusing solely on the extracted dimensions. We represent algorithms for both a star-shaped network and a peer-to-peer network. The approaches effectively reduce communication costs while ensuring minimal cumulative regret per agent. Theoretically, we show that our proposed methods have a regret bound of order $\mathcal{O}(s_0 \log d + s_0 \sqrt{T})$ with high probability, where $T$ is the time horizon. To our best knowledge, it is the first algorithm that tackles row-wise distributed data in sparse linear bandits, achieving comparable performance compared to the state-of-the-art single and multi-agent methods. Besides, it is widely applicable to high-dimensional multi-agent problems where efficient feature extraction is critical for minimizing regret. To validate the effectiveness of our approach, we present experimental results on both synthetic and real-world datasets.

1.2SPMay 8, 2023

Deep Learning and Image Super-Resolution-Guided Beam and Power Allocation for mmWave Networks

Yuwen Cao, Tomoaki Ohtsuki, Setareh Maghsudi et al.

In this paper, we develop a deep learning (DL)-guided hybrid beam and power allocation approach for multiuser millimeter-wave (mmWave) networks, which facilitates swift beamforming at the base station (BS). The following persisting challenges motivated our research: (i) User and vehicular mobility, as well as redundant beam-reselections in mmWave networks, degrade the efficiency; (ii) Due to the large beamforming dimension at the BS, the beamforming weights predicted by the cutting-edge DL-based methods often do not suit the channel distributions; (iii) Co-located user devices may cause a severe beam conflict, thus deteriorating system performance. To address the aforementioned challenges, we exploit the synergy of supervised learning and super-resolution technology to enable low-overhead beam- and power allocation. In the first step, we propose a method for beam-quality prediction. It is based on deep learning and explores the relationship between high- and low-resolution beam images (energy). Afterward, we develop a DL-based allocation approach, which enables high-accuracy beam and power allocation with only a portion of the available time-sequential low-resolution images. Theoretical and numerical results verify the effectiveness of our proposed

3.3NIAug 14, 2020

Computation Offloading in Heterogeneous Vehicular Edge Networks: On-line and Off-policy Bandit Solutions

Arash Bozorgchenani, Setareh Maghsudi, Daniele Tarchi et al.

With the rapid advancement of Intelligent Transportation Systems (ITS) and vehicular communications, Vehicular Edge Computing (VEC) is emerging as a promising technology to support low-latency ITS applications and services. In this paper, we consider the computation offloading problem from mobile vehicles/users in a heterogeneous VEC scenario, and focus on the network- and base station selection problems, where different networks have different traffic loads. In a fast-varying vehicular environment, computation offloading experience of users is strongly affected by the latency due to the congestion at the edge computing servers co-located with the base stations. However, as a result of the non-stationary property of such an environment and also information shortage, predicting this congestion is an involved task. To address this challenge, we propose an on-line learning algorithm and an off-policy learning algorithm based on multi-armed bandit theory. To dynamically select the least congested network in a piece-wise stationary environment, these algorithms predict the latency that the offloaded tasks experience using the offloading history. In addition, to minimize the task loss due to the mobility of the vehicles, we develop a method for base station selection. Moreover, we propose a relaying mechanism for the selected network, which operates based on the sojourn time of the vehicles. Through intensive numerical analysis, we demonstrate that the proposed learning-based solutions adapt to the traffic changes of the network by selecting the least congested network, thereby reducing the latency of offloaded tasks. Moreover, we demonstrate that the proposed joint base station selection and the relaying mechanism minimize the task loss in a vehicular environment.

6.6SPJun 26, 2020Code

Distributed Uplink Beamforming in Cell-Free Networks Using Deep Reinforcement Learning

Firas Fredj, Yasser Al-Eryani, Setareh Maghsudi et al.

The emergence of new wireless technologies together with the requirement of massive connectivity results in several technical issues such as excessive interference, high computational demand for signal processing, and lengthy processing delays. In this work, we propose several beamforming techniques for an uplink cell-free network with centralized, semi-distributed, and fully distributed processing, all based on deep reinforcement learning (DRL). First, we propose a fully centralized beamforming method that uses the deep deterministic policy gradient algorithm (DDPG) with continuous space. We then enhance this method by enabling distributed experience at access points (AP). Indeed, we develop a beamforming scheme that uses the distributed distributional deterministic policy gradients algorithm (D4PG) with the APs representing the distributed agents. Finally, to decrease the computational complexity, we propose a fully distributed beamforming scheme that divides the beamforming computations among APs. The results show that the D4PG scheme with distributed experience achieves the best performance irrespective of the network size. Furthermore, the proposed distributed beamforming technique performs better than the DDPG algorithm with centralized learning only for small-scale networks. The performance superiority of the DDPG model becomes more evident as the number of APs and/or users increases. Moreover, during the operation stage, all DRL models demonstrate a significantly shorter processing time than that of the conventional gradient descent (GD) solution.

4.3SPApr 13, 2020

A Non-Stationary Bandit-Learning Approach to Energy-Efficient Femto-Caching with Rateless-Coded Transmission

Setareh Maghsudi, Mihaela van der Schaar

The ever-increasing demand for media streaming together with limited backhaul capacity renders developing efficient file-delivery methods imperative. One such method is femto-caching, which, despite its great potential, imposes several challenges such as efficient resource management. We study a resource allocation problem for joint caching and transmission in small cell networks, where the system operates in two consecutive phases: (i) cache placement, and (ii) joint file- and transmit power selection followed by broadcasting. We define the utility of every small base station in terms of the number of successful reconstructions per unit of transmission power. We then formulate the problem as to select a file from the cache together with a transmission power level for every broadcast round so that the accumulated utility over the horizon is maximized. The former problem boils down to a stochastic knapsack problem, and we cast the latter as a multi-armed bandit problem. We develop a solution to each problem and provide theoretical and numerical evaluations. In contrast to the state-of-the-art research, the proposed approach is especially suitable for networks with time-variant statistical properties. Moreover, it is applicable and operates well even when no initial information about the statistical characteristics of the random parameters such as file popularity and channel quality is available.

2.3NIFeb 4, 2020

EdgeDASH: Exploiting Network-Assisted Adaptive Video Streaming for Edge Caching

Suzan Bayhan, Setareh Maghsudi, Anatolij Zubow

While edge video caching has great potential to decrease the core network traffic as well as the users' experienced latency, it is often challenging to exploit the caches in current client-driven video streaming solutions due to two key reasons. First, even those clients interested in the same content might request different quality levels as a video content is encoded into multiple qualities to match a wide range of network conditions and device capabilities. Second, the clients, who select the quality of the next chunk to request, are unaware of the cached content at the network edge. Hence, it becomes imperative to develop network-side solutions to exploit caching. This can also mitigate some performance issues, in particular for the scenarios in which multiple video clients compete for some bottleneck capacity. In this paper, we propose a network-side control logic running at a WiFi AP to facilitate the use of cached video content. In particular, an AP can assign a client station a different video quality than its request, in case the alternative quality provides a better utility. We formulate the quality assignment problem as an optimization problem and develop several heuristics with polynomial complexity. Compared to the baseline where the clients determine the quality adaptation, our proposals, referred to as EdgeDASH, offer higher video quality, higher cache hits, and lower stalling ratio which are essential for user's satisfaction. Our simulations show that EdgeDASH facilitates significant cache hits and decreases the buffer stalls only by changing the client's request by one quality level. Moreover, from our analysis, we conclude that the network assistance provides significant performance improvement, especially when the clients with identical interests compete for a bottleneck link's capacity.

3.3NIApr 30, 2016

Distributed Cell Association for Energy Harvesting IoT Devices in Dense Small Cell Networks: A Mean-Field Multi-Armed Bandit Approach

Setareh Maghsudi, Ekram Hossain

The emerging Internet of Things (IoT)-driven ultra-dense small cell networks (UD-SCNs) will need to combat a variety of challenges. On one hand, massive number of devices sharing the limited wireless resources will render centralized control mechanisms infeasible due to the excessive cost of information acquisition and computations. On the other hand, to reduce energy consumption from fixed power grid and/or battery, many IoT devices may need to depend on the energy harvested from the ambient environment (e.g., from RF transmissions, environmental sources). However, due to the opportunistic nature of energy harvesting, this will introduce uncertainty in the network operation. In this article, we study the distributed cell association problem for energy harvesting IoT devices in UD-SCNs. After reviewing the state-of-the-art research on the cell association problem in small cell networks, we outline the major challenges for distributed cell association in IoT-driven UD-SCNs where the IoT devices will need to perform cell association in a distributed manner in presence of uncertainty (e.g., limited knowledge on channel/network) and limited computational capabilities. To this end, we propose an approach based on mean-field multi-armed bandit games to solve the uplink cell association problem for energy harvesting IoT devices in a UD-SCN. This approach is particularly suitable to analyze large multi-agent systems under uncertainty and lack of information. We provide some theoretical results as well as preliminary performance evaluation results for the proposed approach.

13.2LGOct 2, 2015

Multi-armed Bandits with Application to 5G Small Cells

Setareh Maghsudi, Ekram Hossain

Due to the pervasive demand for mobile services, next generation wireless networks are expected to be able to deliver high date rates while wireless resources become more and more scarce. This requires the next generation wireless networks to move towards new networking paradigms that are able to efficiently support resource-demanding applications such as personalized mobile services. Examples of such paradigms foreseen for the emerging fifth generation (5G) cellular networks include very densely deployed small cells and device-to-device communications. For 5G networks, it will be imperative to search for spectrum and energy-efficient solutions to the resource allocation problems that i) are amenable to distributed implementation, ii) are capable of dealing with uncertainty and lack of information, and iii) can cope with users' selfishness. The core objective of this article is to investigate and to establish the potential of multi-armed bandit (MAB) framework to address this challenge. In particular, we provide a brief tutorial on bandit problems, including different variations and solution approaches. Furthermore, we discuss recent applications as well as future research directions. In addition, we provide a detailed example of using an MAB model for energy-efficient small cell planning in 5G networks.