Vivek S. Borkar

LG
h-index50
26papers
265citations
Novelty41%
AI Score43

26 Papers

SYOct 17, 2017
Opportunistic Scheduling as Restless Bandits

Vivek S. Borkar, Gaurav S. Kasbekar, Sarath Pattathil et al.

In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and differ across users; also, the cost incurred, i.e., energy consumed, in packet transmission is a function of the channel quality. We pose the problem as an average cost Markov Decision Problem, and prove that this problem is Whittle Indexable. Based on this result, we propose an algorithm in which the Whittle index of each user is computed and the user who has the lowest value is selected for transmission. We evaluate the performance of this algorithm via simulations and show that it achieves a lower average cost than the Maximum Weight Scheduling and Weighted Fair Scheduling strategies.

PRJan 23, 2013
Asymptotics of the Invariant Measure in Mean Field Models with Jumps

Vivek S. Borkar, Rajesh Sundaresan

We consider the asymptotics of the invariant measure for the process of the empirical spatial distribution of $N$ coupled Markov chains in the limit of a large number of chains. Each chain reflects the stochastic evolution of one particle. The chains are coupled through the dependence of the transition rates on this spatial distribution of particles in the various states. Our model is a caricature for medium access interactions in wireless local area networks. It is also applicable to the study of spread of epidemics in a network. The limiting process satisfies a deterministic ordinary differential equation called the McKean-Vlasov equation. When this differential equation has a unique globally asymptotically stable equilibrium, the spatial distribution asymptotically concentrates on this equilibrium. More generally, its limit points are supported on a subset of the $ω$-limit sets of the McKean-Vlasov equation. Using a control-theoretic approach, we examine the question of large deviations of the invariant measure from this limit.

OCApr 2, 2013
Convergence of The Relative Value Iteration for the Ergodic Control Problem of Nondegenerate Diffusions under Near-Monotone Costs

Ari Arapostathis, Vivek S. Borkar, K. Suresh Kumar

We study the relative value iteration for the ergodic control problem under a near-monotone running cost structure for a nondegenerate diffusion controlled through its drift. This algorithm takes the form of a quasilinear parabolic Cauchy initial value problem in $\RR^{d}$. We show that this Cauchy problem stabilizes, or in other words, that the solution of the quasilinear parabolic equation converges for every bounded initial condition in $\Cc^{2}(\RR^{d})$ to the solution of the Hamilton--Jacobi--Bellman (HJB) equation associated with the ergodic control problem.

LGOct 10, 2022
Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

OCAug 14, 2023
Average cost optimal control under weak ergodicity hypotheses: Relative value iterations

Ari Arapostathis, Vivek S. Borkar

We study Markov decision processes with Polish state and action spaces. The action space is state dependent and is not necessarily compact. We first establish the existence of an optimal ergodic occupation measure using only a near-monotone hypothesis on the running cost. Then we study the well-posedness of Bellman equation, or what is commonly known as the average cost optimality equation, under the additional hypothesis of the existence of a small set. We deviate from the usual approach which is based on the vanishing discount method and instead map the problem to an equivalent one for a controlled split chain. We employ a stochastic representation of the Poisson equation to derive the Bellman equation. Next, under suitable assumptions, we establish convergence results for the 'relative value iteration' algorithm which computes the solution of the Bellman equation recursively. In addition, we present some results concerning the stability and asymptotic optimality of the associated rolling horizon policies.

OCApr 2, 2013
Relative Value Iteration for Stochastic Differential Games

Ari Arapostathis, Vivek S. Borkar, K. Suresh Kumar

We study zero-sum stochastic differential games with player dynamics governed by a nondegenerate controlled diffusion process. Under the assumption of uniform stability, we establish the existence of a solution to the Isaac's equation for the ergodic game and characterize the optimal stationary strategies. The data is not assumed to be bounded, nor do we assume geometric ergodicity. Thus our results extend previous work in the literature. We also study a relative value iteration scheme that takes the form of a parabolic Isaac's equation. Under the hypothesis of geometric ergodicity we show that the relative value iteration converges to the elliptic Isaac's equation as time goes to infinity. We use these results to establish convergence of the relative value iteration for risk-sensitive control problems under an asymptotic flatness assumption.

PFFeb 9, 2019
Distributed Server Allocation for Content Delivery Networks

Sarath Pattathil, Vivek S. Borkar, Gaurav S. Kasbekar

We propose a dynamic formulation of file-sharing networks in terms of an average cost Markov decision process with constraints. By analyzing a Whittle-like relaxation thereof, we propose an index policy in the spirit of Whittle and compare it by simulations with other natural heuristics.

SYNov 21, 2023
Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

Keshav P. Keval, Vivek S. Borkar

In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs.

LGFeb 18
Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh, Siddharth Chandak, Eric Moulines et al.

We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

SYNov 24, 2023
Approximation of Convex Envelope Using Reinforcement Learning

Vivek S. Borkar, Adit Akarsh

Oberman gave a stochastic control formulation of the problem of estimating the convex envelope of a non-convex function. Based on this, we develop a reinforcement learning scheme to approximate the convex envelope, using a variant of Q-learning for controlled optimal stopping. It shows very promising results on a standard library of test problems.

LGMay 19
Adynamical systems view of training generativemodels and the memorization phenomenon

Siva Athreya, Chiranjib Bhattacharya, Vivek S. Borkar

Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.

LGDec 16, 2023
A Concentration Bound for TD(0) with Function Approximation

Siddharth Chandak, Vivek S. Borkar

We derive a concentration bound of the type `for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.

LGDec 17, 2024
Lagrangian Index Policy for Restless Bandits with Average Reward

Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah

We study the Lagrange Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrange index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.

LGFeb 17, 2025
An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Soumyajit Guin, Vivek S. Borkar, Shalabh Bhatnagar

In this paper, we consider the risk-sensitive cost criterion with exponentiated costs for Markov decision processes and develop a model-free policy gradient algorithm in this setting. Unlike additive cost criteria such as average or discounted cost, the risk-sensitive cost criterion is less studied due to the complexity resulting from the multiplicative structure of the resulting Bellman equation. We develop an actor-critic algorithm with function approximation in this setting and provide its asymptotic convergence analysis. We also show the results of numerical experiments that demonstrate the superiority in performance of our algorithm over other recent algorithms in the literature.

LGNov 4, 2021
A Concentration Bound for LSPE($λ$)

Siddharth Chandak, Vivek S. Borkar, Harsh Dolhare

The popular LSPE($λ$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.

LGJun 27, 2021
Concentration of Contractive Stochastic Approximation and Reinforcement Learning

Siddharth Chandak, Vivek S. Borkar, Parth Dodhia

Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0).

OCJul 8, 2020
Dynamic social learning under graph constraints

Konstantin Avrachenkov, Vivek S. Borkar, Sharayu Moharir et al.

We introduce a model of graph-constrained dynamic choice with reinforcement modeled by positively $α$-homogeneous rewards. We show that its empirical process, which can be written as a stochastic approximation recursion with Markov noise, has the same probability law as a certain vertex reinforced random walk. We use this equivalence to show that for $α> 0$, the asymptotic outcome concentrates around the optimum in a certain limiting sense when `annealed' by letting $α\uparrow\infty$ slowly.

LGApr 29, 2020
Whittle index based Q-learning for restless bandits with average reward

Konstantin E. Avrachenkov, Vivek S. Borkar

A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.

ROSep 11, 2017
Vector Field Guidance for Convoy Monitoring Using Elliptical Orbits

Aseem V. Borkar, Vivek S. Borkar, Arpita Sinha

We propose a novel vector field based guidance scheme for tracking and surveillance of a convoy, moving along a possibly nonlinear trajectory on the ground, by an aerial agent. The scheme first computes a time varying ellipse that encompasses all the targets in the convoy using a simple regression based algorithm. It then ensures convergence of the agent to a trajectory that repeatedly traverses this moving ellipse. The scheme is analyzed using perturbation theory of nonlinear differential equations and supporting simulations are provided. Some related implementation issues are discussed and advantages of the scheme are highlighted.

SYAug 28, 2017
Distributed Stochastic Approximation with Local Projections

Suhail M. Shah, Vivek S. Borkar

We propose a distributed version of a stochastic approximation scheme constrained to remain in the intersection of a finite family of convex sets. The projection to the intersection of these sets is also computed in a distributed manner and a `nonlinear gossip' mechanism is employed to blend the projection iterations with the stochastic approximation using multiple time scales

SYJul 13, 2017
Whittle Indexability in Egalitarian Processor Sharing Systems

Vivek S. Borkar, Sarath Pattathil

The egalitarian processor sharing model is viewed as a restless bandit and its Whittle indexability is established. A numerical scheme for computing the Whittle indices is provided, along with supporting numerical experiments.

LGMay 9, 2016
Randomized Kaczmarz for Rank Aggregation from Pairwise Comparisons

Vivek S. Borkar, Nikhil Karamchandani, Sharad Mirani

We revisit the problem of inferring the overall ranking among entities in the framework of Bradley-Terry-Luce (BTL) model, based on available empirical data on pairwise preferences. By a simple transformation, we can cast the problem as that of solving a noisy linear system, for which a ready algorithm is available in the form of the randomized Kaczmarz method. This scheme is provably convergent, has excellent empirical performance, and is amenable to on-line, distributed and asynchronous variants. Convergence, convergence rate, and error analysis of the proposed algorithm are presented and several numerical experiments are conducted whose results validate our theoretical findings.

MLNov 27, 2015
Gradient Estimation with Simultaneous Perturbation and Compressive Sensing

Vivek S. Borkar, Vikranth R. Dwaracherla, Neeraja Sahasrabudhe

This paper aims at achieving a "good" estimator for the gradient of a function on a high-dimensional space. Often such functions are not sensitive in all coordinates and the gradient of the function is almost sparse. We propose a method for gradient estimation that combines ideas from Spall's Simultaneous Perturbation Stochastic Approximation with compressive sensing. The aim is to obtain "good" estimator without too many function evaluations. Application to estimating gradient outer product matrix as well as standard optimization problems are illustrated via simulations.

OCNov 30, 2014
Empirical Q-Value Iteration

Dileep Kalathil, Vivek S. Borkar, Rahul Jain

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm doesn't depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration (EQVI) algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a non-asymptotic sample complexity bound, and also show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ball park estimate for our algorithm compared to stochastic approximation-based algorithms.

LGNov 1, 2013
Reinforcement Learning for Matrix Computations: PageRank as an Example

Vivek S. Borkar, Adwaitvedant S. Mathkar

Reinforcement learning has gained wide popularity as a technique for simulation-driven approximate dynamic programming. A less known aspect is that the very reasons that make it effective in dynamic programming can also be leveraged for using it for distributed schemes for certain matrix computations involving non-negative matrices. In this spirit, we propose a reinforcement learning algorithm for PageRank computation that is fashioned after analogous schemes for approximate dynamic programming. The algorithm has the advantage of ease of distributed implementation and more importantly, of being model-free, i.e., not dependent on any specific assumptions about the transition probabilities in the random web-surfer model. We analyze its convergence and finite time behavior and present some supporting numerical experiments.

DCOct 28, 2013
Distributed Reinforcement Learning via Gossip

Adwaitvedant S. Mathkar, Vivek S. Borkar

We consider the classical TD(0) algorithm implemented on a network of agents wherein the agents also incorporate the updates received from neighboring agents using a gossip-like mechanism. The combined scheme is shown to converge for both discounted and average cost problems.