Sharayu Moharir

h-index11

16papers

65citations

Novelty57%

AI Score54

Ranked #9,799 of 194,257 authors (top 5%)#2,559 in LG (top 6%)

16 Papers

6.4AIMay 6Code

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir et al.

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.

7.8LGJul 9, 2022

Multi-Model Federated Learning with Provable Guarantees

Neelkamal Bhuyan, Sharayu Moharir, Gauri Joshi

Federated Learning (FL) is a variant of distributed learning where edge devices collaborate to learn a model without sharing their data with the central server or each other. We refer to the process of training multiple independent models simultaneously in a federated setting using a common pool of clients as multi-model FL. In this work, we propose two variants of the popular FedAvg algorithm for multi-model FL, with provable convergence guarantees. We further show that for the same amount of computation, multi-model FL can have better performance than training each model separately. We supplement our theoretical results with experiments in strongly convex, convex, and non-convex settings.

2.3DCMar 13, 2023

On the Regret of Online Edge Service Hosting

R Sri Prakash, Nikhil Karamchandani, Sharayu Moharir

We consider the problem of service hosting where a service provider can dynamically rent edge resources via short term contracts to ensure better quality of service to its customers. The service can also be partially hosted at the edge, in which case, customers' requests can be partially served at the edge. The total cost incurred by the system is modeled as a combination of the rent cost, the service cost incurred due to latency in serving customers, and the fetch cost incurred as a result of the bandwidth used to fetch the code/databases of the service from the cloud servers to host the service at the edge. In this paper, we compare multiple hosting policies with regret as a metric, defined as the difference in the cost incurred by the policy and the optimal policy over some time horizon $T$. In particular we consider the Retro Renting (RR) and Follow The Perturbed Leader (FTPL) policies proposed in the literature and provide performance guarantees on the regret of these policies. We show that under i.i.d stochastic arrivals, RR policy has linear regret while FTPL policy has constant regret. Next, we propose a variant of FTPL, namely Wait then FTPL (W-FTPL), which also has constant regret while demonstrating much better dependence on the fetch cost. We also show that under adversarial arrivals, RR policy has linear regret while both FTPL and W-FTPL have regret $\mathrm{O}(\sqrt{T})$ which is order-optimal.

4.1LGNov 14, 2025

Cascading Bandits With Feedback

R Sri Prakash, Nikhil Karamchandani, Sharayu Moharir

Motivated by the challenges of edge inference, we study a variant of the cascade bandit model in which each arm corresponds to an inference model with an associated accuracy and error probability. We analyse four decision-making policies-Explore-then-Commit, Action Elimination, Lower Confidence Bound (LCB), and Thompson Sampling-and provide sharp theoretical regret guarantees for each. Unlike in classical bandit settings, Explore-then-Commit and Action Elimination incur suboptimal regret because they commit to a fixed ordering after the exploration phase, limiting their ability to adapt. In contrast, LCB and Thompson Sampling continuously update their decisions based on observed feedback, achieving constant O(1) regret. Simulations corroborate these theoretical findings, highlighting the crucial role of adaptivity for efficient edge inference under uncertainty.

1.2SIFeb 13, 2020

Influencing Opinions of Heterogeneous Populations over Finite Time Horizons

Arunabh Saxena, Bhumesh Kumar, Anmol Gupta et al.

In this work, we focus on strategies to influence the opinion dynamics of a well-connected society. We propose a generalization of the popular voter model. This variant of the voter model can capture a wide range of individuals including strong-willed individuals whose opinion evolution is independent of their neighbors as well as conformist/rebel individuals who tend to adopt the opinion of the majority/minority. Motivated by political campaigns which aim to influence opinion dynamics by the end of a fixed deadline, we focus on influencing strategies for finite time horizons. We characterize the nature of optimal influencing strategies as a function of the nature of individuals forming the society. Using this, we show that for a society consisting of predominantly strong-willed/rebel individuals, the optimal strategy is to influence towards the end of the finite time horizon, whereas, for a society predominantly consisting of conformist individuals who try to adopt the opinion of the majority, it could be optimal to influence in the initial phase of the finite time horizon.

1.4LGMar 4

Fixed-Budget Constrained Best Arm Identification in Grouped Bandits

Raunak Mukherjee, Sharayu Moharir

We study fixed budget constrained best-arm identification in grouped bandits, where each arm consists of multiple independent attributes with stochastic rewards. An arm is considered feasible only if all its attributes' means are above a given threshold. The aim is to find the feasible arm with the largest overall mean. We first derive a lower bound on the error probability for any algorithm on this setting. We then propose Feasibility Constrained Successive Rejects (FCSR), a novel algorithm that identifies the best arm while ensuring feasibility. We show it attains optimal dependence on problem parameters up to constant factors in the exponent. Empirically, FCSR outperforms natural baselines while preserving feasibility guarantees.

4.1LGSep 26, 2025

Observation-Free Attacks on Online Learning to Rank

Sameep Chattopadhyay, Nikhil Karamchandani, Sharayu Moharir

Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.

4.1LGSep 19, 2025

Inference Offloading for Cost-Sensitive Binary Classification at the Edge

Vishnu Narayanan Moothedath, Umang Agarwal, Umeshraja N et al.

We focus on a binary classification problem in an edge intelligence system where false negatives are more costly than false positives. The system has a compact, locally deployed model, which is supplemented by a larger, remote model, which is accessible via the network by incurring an offloading cost. For each sample, our system first uses the locally deployed model for inference. Based on the output of the local model, the sample may be offloaded to the remote model. This work aims to understand the fundamental trade-off between classification accuracy and the offloading costs within such a hierarchical inference (HI) system. To optimise this system, we propose an online learning framework that continuously adapts a pair of thresholds on the local model's confidence scores. These thresholds determine the prediction of the local model and whether a sample is classified locally or offloaded to the remote model. We present a closed-form solution for the setting where the local model is calibrated. For the more general case of uncalibrated models, we introduce H2T2, an online two-threshold hierarchical inference policy, and prove it achieves sublinear regret. H2T2 is model-agnostic, requires no training, and learns during the inference phase using limited feedback. Simulations on real-world datasets show that H2T2 consistently outperforms naive and single-threshold HI policies, sometimes even surpassing offline optima. The policy also demonstrates robustness to distribution shifts and adapts effectively to mismatched classifiers.

4.1LGFeb 11, 2025

Fixed-Confidence Best Arm Identification with Decreasing Variance

Tamojeet Roychowdhury, Kota Srinivas Reddy, Krishna P Jagannathan et al.

We focus on the problem of best-arm identification in a stochastic multi-arm bandit with temporally decreasing variances for the arms' rewards. We model arm rewards as Gaussian random variables with fixed means and variances that decrease with time. The cost incurred by the learner is modeled as a weighted sum of the time needed by the learner to identify the best arm, and the number of samples of arms collected by the learner before termination. Under this cost function, there is an incentive for the learner to not sample arms in all rounds, especially in the initial rounds. On the other hand, not sampling increases the termination time of the learner, which also increases cost. This trade-off necessitates new sampling strategies. We propose two policies. The first policy has an initial wait period with no sampling followed by continuous sampling. The second policy samples periodically and uses a weighted average of the rewards observed to identify the best arm. We provide analytical guarantees on the performance of both policies and supplement our theoretical results with simulations which show that our polices outperform the state-of-the-art policies for the classical best arm identification problem.

6.4LGDec 11, 2024

Constrained Best Arm Identification in Grouped Bandits

Sahil Dharod, Malyala Preethi Sravani, Sakshi Heda et al.

We study a grouped bandit setting where each arm comprises multiple independent sub-arms referred to as attributes. Each attribute of each arm has an independent stochastic reward. We impose the constraint that for an arm to be deemed feasible, the mean reward of all its attributes should exceed a specified threshold. The goal is to find the arm with the highest mean reward averaged across attributes among the set of feasible arms in the fixed confidence setting. We first characterize a fundamental limit on the performance of any policy. Following this, we propose a near-optimal confidence interval-based policy to solve this problem and provide analytical guarantees for the policy. We compare the performance of the proposed policy with that of two suitably modified versions of action elimination via simulations.

2.6LGFeb 29, 2024

Influencing Bandits: Arm Selection for Preference Shaping

Viraj Nadkarni, D. Manjunath, Sharayu Moharir

We consider a non stationary multi-armed bandit in which the population preferences are positively and negatively reinforced by the observed rewards. The objective of the algorithm is to shape the population preferences to maximize the fraction of the population favouring a predetermined arm. For the case of binary opinions, two types of opinion dynamics are considered -- decreasing elasticity (modeled as a Polya urn with increasing number of balls) and constant elasticity (using the voter model). For the first case, we describe an Explore-then-commit policy and a Thompson sampling policy and analyse the regret for each of these policies. We then show that these algorithms and their analyses carry over to the constant elasticity case. We also describe a Thompson sampling based algorithm for the case when more than two types of opinions are present. Finally, we discuss the case where presence of multiple recommendation systems gives rise to a trade-off between their popularity and opinion shaping objectives.

7.8LGJan 7, 2022

Multi-Model Federated Learning

Neelkamal Bhuyan, Sharayu Moharir

Federated learning is a form of distributed learning with the key challenge being the non-identically distributed nature of the data in the participating clients. In this paper, we extend federated learning to the setting where multiple unrelated models are trained simultaneously. Specifically, every client is able to train any one of M models at a time and the server maintains a model for each of the M models which is typically a suitably averaged version of the model computed by the clients. We propose multiple policies for assigning learning tasks to clients over time. In the first policy, we extend the widely studied FedAvg to multi-model learning by allotting models to clients in an i.i.d. stochastic manner. In addition, we propose two new policies for client selection in a multi-model federated setting which make decisions based on current local losses for each client-model pair. We compare the performance of the policies on tasks involving synthetic and real-world data and characterize the performance of the proposed policies. The key take-away from our work is that the proposed multi-model policies perform better or at least as good as single model training using FedAvg.

2.3DSNov 3, 2020

Greedy k-Center from Noisy Distance Samples

Neharika Jali, Nikhil Karamchandani, Sharayu Moharir

We study a variant of the canonical k-center problem over a set of vertices in a metric space, where the underlying distances are apriori unknown. Instead, we can query an oracle which provides noisy/incomplete estimates of the distance between any pair of vertices. We consider two oracle models: Dimension Sampling where each query to the oracle returns the distance between a pair of points in one dimension; and Noisy Distance Sampling where the oracle returns the true distance corrupted by noise. We propose active algorithms, based on ideas such as UCB, Thompson Sampling and Track-and-Stop developed in the closely related Multi-Armed Bandit problem, which adaptively decide which queries to send to the oracle and are able to solve the k-center problem within an approximation ratio of two with high probability. We analytically characterize instance-dependent query complexity of our algorithms and also demonstrate significant improvements over naive implementations via numerical evaluations on two real-world datasets (Tiny ImageNet and UT Zappos50K).

4.7OCJul 8, 2020

Dynamic social learning under graph constraints

Konstantin Avrachenkov, Vivek S. Borkar, Sharayu Moharir et al.

We introduce a model of graph-constrained dynamic choice with reinforcement modeled by positively $α$-homogeneous rewards. We show that its empirical process, which can be written as a stochastic approximation recursion with Markov noise, has the same probability law as a certain vertex reinforced random walk. We use this equivalence to show that for $α> 0$, the asymptotic outcome concentrates around the optimum in a certain limiting sense when `annealed' by letting $α\uparrow\infty$ slowly.

1.2MLNov 14, 2019

Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems

Aditya Narayan Ravi, Pranav Poduval, Sharayu Moharir

We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.

1.0LGNov 14, 2019

Contextual Bandits Evolving Over Finite Time

Harsh Deshpande, Vishal Jain, Sharayu Moharir

Contextual bandits have the same exploration-exploitation trade-off as standard multi-armed bandits. On adding positive externalities that decay with time, this problem becomes much more difficult as wrong decisions at the start are hard to recover from. We explore existing policies in this setting and highlight their biases towards the inherent reward matrix. We propose a rejection based policy that achieves a low regret irrespective of the structure of the reward probability matrix.