LGAug 3, 2022
Bayesian regularization of empirical MDPsSamarth Gupta, Daniel N. Hill, Lexing Ying et al.
In most applications of model-based Markov decision processes, the parameters for the unknown underlying model are often estimated from the empirical data. Due to noise, the policy learnedfrom the estimated model is often far from the optimal policy of the underlying model. When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling for solutions with better generalization performance. In this work we take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information in order to obtain more robust policies. Two approaches are proposed, one based on $L^1$ regularization and the other on relative entropic regularization. We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store. Our results demonstrate the robustness of regularized MDP policies against the noise present in the models.
OCJun 30, 2023
Uncertainty Informed Optimal Resource Allocation with Gaussian Process based Bayesian InferenceSamarth Gupta, Saurabh Amin
We focus on the problem of uncertainty informed allocation of medical resources (vaccines) to heterogeneous populations for managing epidemic spread. We tackle two related questions: (1) For a compartmental ordinary differential equation (ODE) model of epidemic spread, how can we estimate and integrate parameter uncertainty into resource allocation decisions? (2) How can we computationally handle both nonlinear ODE constraints and parameter uncertainties for a generic stochastic optimization problem for resource allocation? To the best of our knowledge current literature does not fully resolve these questions. Here, we develop a data-driven approach to represent parameter uncertainty accurately and tractably in a novel stochastic optimization problem formulation. We first generate a tractable scenario set by estimating the distribution on ODE model parameters using Bayesian inference with Gaussian processes. Next, we develop a parallelized solution algorithm that accounts for scenario-dependent nonlinear ODE constraints. Our scenario-set generation procedure and solution approach are flexible in that they can handle any compartmental epidemiological ODE model. Our computational experiments on two different non-linear ODE models (SEIR and SEPIHR) indicate that accounting for uncertainty in key epidemiological parameters can improve the efficacy of time-critical allocation decisions by 4-8%. This improvement can be attributed to data-driven and optimal (strategic) nature of vaccine allocations, especially in the early stages of the epidemic when the allocation strategy can crucially impact the long-term trajectory of the disease.
SEDec 26, 2025
State-of-the-art Small Language Coder Model: Mify-CoderAbhinav Parmar, Abhisek Panigrahi, Abhishek Kumar Dwivedi et al.
We present Mify-Coder, a 2.5B-parameter code model trained on 4.2T tokens using a compute-optimal strategy built on the Mify-2.5B foundation model. Mify-Coder achieves comparable accuracy and safety while significantly outperforming much larger baseline models on standard coding and function-calling benchmarks, demonstrating that compact models can match frontier-grade models in code generation and agent-driven workflows. Our training pipeline combines high-quality curated sources with synthetic data generated through agentically designed prompts, refined iteratively using enterprise-grade evaluation datasets. LLM-based quality filtering further enhances data density, enabling frugal yet effective training. Through disciplined exploration of CPT-SFT objectives, data mixtures, and sampling dynamics, we deliver frontier-grade code intelligence within a single continuous training trajectory. Empirical evidence shows that principled data and compute discipline allow smaller models to achieve competitive accuracy, efficiency, and safety compliance. Quantized variants of Mify-Coder enable deployment on standard desktop environments without requiring specialized hardware.
LGAug 20, 2025
Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-statesSamarth Gupta, Raghudeep Gadde, Rui Chen et al.
We challenge a fundamental assumption of diffusion models, namely, that a large number of latent-states or time-steps is required for training so that the reverse generative process is close to a Gaussian. We first show that with careful selection of a noise schedule, diffusion models trained over a small number of latent states (i.e. $T \sim 32$) match the performance of models trained over a much large number of latent states ($T \sim 1,000$). Second, we push this limit (on the minimum number of latent states required) to a single latent-state, which we refer to as complete disentanglement in T-space. We show that high quality samples can be easily generated by the disentangled model obtained by combining several independently trained single latent-state models. We provide extensive experiments to show that the proposed disentangled model provides 4-6$\times$ faster convergence measured across a variety of metrics on two different datasets.
LGOct 5, 2021
Approximate Newton policy gradient algorithmsHaoya Li, Samarth Gupta, Hsiangfu Yu et al.
Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting algorithm reproduces the natural policy gradient algorithm. For other entropy functions, this method results in brand-new policy gradient algorithms. We prove that all these algorithms enjoy Newton-type quadratic convergence and that the corresponding gradient flow converges globally to the optimal solution. We use synthetic and industrial-scale examples to demonstrate that the proposed approximate Newton method typically converges in single-digit iterations, often orders of magnitude faster than other state-of-the-art algorithms.
MLSep 10, 2021
Best-Arm Identification in Correlated Multi-Armed BanditsSamarth Gupta, Gauri Joshi, Osman Yağan
In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-δ$ for some $δ>0$, the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal{K}$. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form $\mathcal{O}\left(\sum_{k \in \mathcal{C}} \log\left(\frac{1}δ\right)\right)$ as opposed to the typical $\mathcal{O}\left(\sum_{k \in \mathcal{K}} \log\left(\frac{1}δ\right)\right)$ samples required in the independent reward setting. The improvement comes, as the $\mathcal{O}(\log(1/δ))$ term is summed only for the set of competitive arms $\mathcal{C}$, which is a subset of the original set of arms $\mathcal{K}$. The size of the set $\mathcal{C}$, depending on the problem setting, can be as small as $2$, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.
LGDec 14, 2020
Bandit-based Communication-Efficient Client Selection Strategies for Federated LearningYae Jee Cho, Samarth Gupta, Gauri Joshi et al.
Due to communication constraints and intermittent client availability in federated learning, only a subset of clients can participate in each training round. While most prior works assume uniform and unbiased client selection, recent work on biased client selection has shown that selecting clients with higher local losses can improve error convergence speed. However, previously proposed biased selection strategies either require additional communication cost for evaluating the exact local loss or utilize stale local loss, which can even make the model diverge. In this paper, we present a bandit-based communication-efficient client selection strategy UCB-CS that achieves faster convergence with lower communication overhead. We also demonstrate how client selection can be used to improve fairness.
LGOct 30, 2020
Integer Programming-based Error-Correcting Output Code Design for Robust ClassificationSamarth Gupta, Saurabh Amin
Error-Correcting Output Codes (ECOCs) offer a principled approach for combining simple binary classifiers into multiclass classifiers. In this paper, we investigate the problem of designing optimal ECOCs to achieve both nominal and adversarial accuracy using Support Vector Machines (SVMs) and binary deep learning models. In contrast to previous literature, we present an Integer Programming (IP) formulation to design minimal codebooks with desirable error correcting properties. Our work leverages the advances in IP solvers to generate codebooks with optimality guarantees. To achieve tractability, we exploit the underlying graph-theoretic structure of the constraint set in our IP formulation. This enables us to use edge clique covers to substantially reduce the constraint set. Our codebooks achieve a high nominal accuracy relative to standard codebooks (e.g., one-vs-all, one-vs-one, and dense/sparse codes). We also estimate the adversarial accuracy of our ECOC-based classifiers in a white-box setting. Our IP-generated codebooks provide non-trivial robustness to adversarial perturbations even without any adversarial training.
MLNov 6, 2019
Multi-Armed Bandits with Correlated ArmsSamarth Gupta, Shreyas Chaudhari, Gauri Joshi et al.
We consider a multi-armed bandit framework where the rewards obtained by pulling different arms are correlated. We develop a unified approach to leverage these reward correlations and present fundamental generalizations of classic bandit algorithms to the correlated setting. We present a unified proof technique to analyze the proposed algorithms. Rigorous analysis of C-UCB (the correlated bandit version of Upper-confidence-bound) reveals that the algorithm ends up pulling certain sub-optimal arms, termed as non-competitive, only O(1) times, as opposed to the O(log T) pulls required by classic bandit algorithms such as UCB, TS etc. We present regret-lower bound and show that when arms are correlated through a latent random source, our algorithms obtain order-optimal regret. We validate the proposed algorithms via experiments on the MovieLens and Goodreads datasets, and show significant improvement over classical bandit algorithms.
MLOct 18, 2018
A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit SettingSamarth Gupta, Shreyas Chaudhari, Subhojyoti Mukherjee et al.
We consider a finite-armed structured bandit problem in which mean rewards of different arms are known functions of a common hidden parameter $θ^*$. Since we do not place any restrictions of these functions, the problem setting subsumes several previously studied frameworks that assume linear or invertible reward functions. We propose a novel approach to gradually estimate the hidden $θ^*$ and use the estimate together with the mean reward functions to substantially reduce exploration of sub-optimal arms. This approach enables us to fundamentally generalize any classic bandit algorithm including UCB and Thompson Sampling to the structured bandit setting. We prove via regret analysis that our proposed UCB-C algorithm (structured bandit versions of UCB) pulls only a subset of the sub-optimal arms $O(\log T)$ times while the other sub-optimal arms (referred to as non-competitive arms) are pulled $O(1)$ times. As a result, in cases where all sub-optimal arms are non-competitive, which can happen in many practical scenarios, the proposed algorithms achieve bounded regret. We also conduct simulations on the Movielens recommendations dataset to demonstrate the improvement of the proposed algorithms over existing structured bandit algorithms.
MLAug 17, 2018
Correlated Multi-armed Bandits with a Latent Random SourceSamarth Gupta, Gauri Joshi, Osman Yağan
We consider a novel multi-armed bandit framework where the rewards obtained by pulling the arms are functions of a common latent random variable. The correlation between arms due to the common random source can be used to design a generalized upper-confidence-bound (UCB) algorithm that identifies certain arms as $non-competitive$, and avoids exploring them. As a result, we reduce a $K$-armed bandit problem to a $C+1$-armed problem, where $C+1$ includes the best arm and $C$ $competitive$ arms. Our regret analysis shows that the competitive arms need to be pulled $\mathcal{O}(\log T)$ times, while the non-competitive arms are pulled only $\mathcal{O}(1)$ times. As a result, there are regimes where our algorithm achieves a $\mathcal{O}(1)$ regret as opposed to the typical logarithmic regret scaling of multi-armed bandit algorithms. We also evaluate lower bounds on the expected regret and prove that our correlated-UCB algorithm achieves $\mathcal{O}(1)$ regret whenever possible.
LGAug 16, 2018
Active Distribution Learning from Indirect SamplesSamarth Gupta, Gauri Joshi, Osman Yağan
This paper studies the problem of {\em learning} the probability distribution $P_X$ of a discrete random variable $X$ using indirect and sequential samples. At each time step, we choose one of the possible $K$ functions, $g_1, \ldots, g_K$ and observe the corresponding sample $g_i(X)$. The goal is to estimate the probability distribution of $X$ by using a minimum number of such sequential samples. This problem has several real-world applications including inference under non-precise information and privacy-preserving statistical estimation. We establish necessary and sufficient conditions on the functions $g_1, \ldots, g_K$ under which asymptotically consistent estimation is possible. We also derive lower bounds on the estimation error as a function of total samples and show that it is order-wise achievable. Leveraging these results, we propose an iterative algorithm that i) chooses the function to observe at each step based on past observations; and ii) combines the obtained samples to estimate $p_X$. The performance of this algorithm is investigated numerically under various scenarios, and shown to outperform baseline approaches.