LGMar 23, 2023
Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit FeedbackMohammad Pedramfar, Vaneet Aggarwal
This paper investigates the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and full-bandit delayed feedback, where the delayed feedback is assumed to be composite and anonymous. In other words, the delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components. Three models of delayed feedback: bounded adversarial, stochastic independent, and stochastic conditionally independent are studied, and regret bounds are derived for each of the delay models. Ignoring the problem dependent parameters, we show that regret bound for all the delay models is $\tilde{O}(T^{2/3} + T^{1/3} ν)$ for time horizon $T$, where $ν$ is a delay parameter defined differently in the three cases, thus demonstrating an additive term in regret with delay in all the three delay models. The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.
51.5LGApr 28
The Role of Symmetry in Optimizing Overparameterized NetworksKusha Sareen, Mohammad Pedramfar, Sékou-Oumar Kaba et al.
Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely understood. We analyze weight-space symmetries in neural networks and show that overparameterization introduces additional symmetries that benefit optimization in two distinct ways. First, we prove that these symmetries act as a form of diagonal preconditioning on the Hessian, enabling the existence of better-conditioned minima within each equivalence class of functionally identical solutions. Second, we show that overparameterization increases the probability mass of global minima near typical initializations, making these favorable solutions more reachable. Teacher-student network experiments validate our theoretical predictions: as width increases, the Hessian trace decreases, condition numbers improve, and convergence accelerates. Our analysis provides a unified framework for understanding overparameterization and width growth as a geometric transformation of the loss landscape.
MLOct 30, 2023
Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement LearningAhmadreza Moradipari, Mohammad Pedramfar, Modjtaba Shokrian Zini et al.
In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.
LGFeb 13
$γ$-weakly $θ$-up-concavity: Linearizable Non-Convex Optimization with Applications to DR-Submodular and OSS FunctionsMohammad Pedramfar, Vaneet Aggarwal
Optimizing monotone non-convex functions is a fundamental challenge across machine learning and combinatorial optimization. We introduce and study $γ$-weakly $θ$-up-concavity, a novel first-order condition that characterizes a broad class of such functions. This condition provides a powerful unifying framework, strictly generalizing both DR-submodular functions and One-Sided Smooth (OSS) functions. Our central theoretical contribution demonstrates that $γ$-weakly $θ$-up-concave functions are upper-linearizable: for any feasible point, we can construct a linear surrogate whose gains provably approximate the original non-linear objective. This approximation holds up to a constant factor, namely the approximation coefficient, dependent solely on $γ$, $θ$, and the geometry of the feasible set. This linearizability yields immediate and unified approximation guarantees for a wide range of problems. Specifically, we obtain unified approximation guarantees for offline optimization as well as static and dynamic regret bounds in online settings via standard reductions to linear optimization. Moreover, our framework recovers the optimal approximation coefficient for DR-submodular maximization and significantly improves existing approximation coefficients for OSS optimization, particularly over matroid constraints.
LGFeb 24
Upper-Linearizability of Online Non-Monotone DR-Submodular Maximization over Down-Closed Convex SetsYiyang Lu, Haresh Jadav, Mohammad Pedramfar et al.
We study online maximization of non-monotone Diminishing-Return(DR)-submodular functions over down-closed convex sets, a regime where existing projection-free online methods suffer from suboptimal regret and limited feedback guarantees. Our main contribution is a new structural result showing that this class is $1/e$-linearizable under carefully designed exponential reparametrization, scaling parameter, and surrogate potential, enabling a reduction to online linear optimization. As a result, we obtain $O(T^{1/2})$ static regret with a single gradient query per round and unlock adaptive and dynamic regret guarantees, together with improved rates under semi-bandit, bandit, and zeroth-order feedback. Across all feedback models, our bounds strictly improve the state of the art.
LGJun 25, 2025
Diffusion Tree Sampling: Scalable inference-time alignment of diffusion modelsVineet Jain, Kusha Sareen, Mohammad Pedramfar et al.
Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling. Existing steering methods suffer from inaccurate value estimation, especially at high noise levels, which biases guidance. Moreover, information from past runs is not reused to improve sample quality, resulting in inefficient use of compute. Inspired by the success of Monte Carlo Tree Search, we address these limitations by casting inference-time alignment as a search problem that reuses past computations. We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain and iteratively refining value estimates with each additional generation. Our proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS$^\star$), performs a global search for high reward samples. On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to $10\times$ less compute. In text-to-image generation and language completion tasks, DTS$^\star$ effectively searches for high reward samples that match best-of-N with up to $5\times$ less compute. By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples, providing a scalable approach for inference-time alignment of diffusion models.
OCApr 27, 2024
From Linear to Linearizable Optimization: A Novel Framework with Applications to Stationary and Non-stationary DR-submodular OptimizationMohammad Pedramfar, Vaneet Aggarwal
This paper introduces the notion of upper-linearizable/quadratizable functions, a class that extends concavity and DR-submodularity in various settings, including monotone and non-monotone cases over different convex sets. A general meta-algorithm is devised to convert algorithms for linear/quadratic maximization into ones that optimize upper-linearizable/quadratizable functions, offering a unified approach to tackling concave and DR-submodular optimization problems. The paper extends these results to multiple feedback settings, facilitating conversions between semi-bandit/first-order feedback and bandit/zeroth-order feedback, as well as between first/zeroth-order feedback and semi-bandit/bandit feedback. Leveraging this framework, new algorithms are derived using existing results as base algorithms for convex optimization, improving upon state-of-the-art results in various cases. Dynamic and adaptive regret guarantees are obtained for DR-submodular maximization, marking the first algorithms to achieve such guarantees in these settings. Notably, the paper achieves these advancements with fewer assumptions compared to existing state-of-the-art results, underscoring its broad applicability and theoretical contributions to non-convex optimization.
LGFeb 23, 2025
Order-Optimal Projection-Free Algorithm for Adversarially Constrained Online Convex OptimizationYiyang Lu, Mohammad Pedramfar, Vaneet Aggarwal
Projection-based algorithms for constrained Online Convex Optimization (COCO) face scalability challenges in high-dimensional settings due to the computational complexity of projecting iterates onto constraint sets. This paper introduces a projection-free algorithm for COCO that achieves state-of-the-art performance guarantees while eliminating the need for projections. By integrating a separation oracle with adaptive Online Gradient Descent (OGD) and employing a Lyapunov-driven surrogate function, while dynamically adjusting step sizes using gradient norms, our method jointly optimizes the regret and cumulative constraint violation (CCV). We also use a blocked version of OGD that helps achieve tradeoffs betweeen the regret and CCV with the number of calls to the separation oracle. For convex cost functions, our algorithm attains an optimal regret of $\mathcal{O}(\sqrt{T})$ and a CCV of $\mathcal{O}(\sqrt{T} \log T)$, matching the best-known projection-based results, while only using $\tilde{\mathcal{O}}({T})$ calls to the separation oracle. The results also demonstrate a tradeoff where lower calls to the separation oracle increase the regret and the CCV. In the strongly convex setting, we further achieve a regret of $\mathcal{O}(\log T)$ and a CCV of $\mathcal{O}(\sqrt{T\log T} )$, while requiring ${\mathcal{O}}({T}^2)$ calls to the separation oracle. Further, tradeoff with the decreasing oracle calls is studied. These results close the gap between projection-free and projection-based approaches, demonstrating that projection-free methods can achieve performance comparable to projection-based counterparts.
LGMar 15, 2024
Unified Projection-Free Algorithms for Adversarial DR-Submodular OptimizationMohammad Pedramfar, Yididiya Y. Nadew, Christopher J. Quinn et al.
This paper introduces unified projection-free Frank-Wolfe type algorithms for adversarial continuous DR-submodular optimization, spanning scenarios such as full information and (semi-)bandit feedback, monotone and non-monotone functions, different constraints, and types of stochastic queries. For every problem considered in the non-monotone setting, the proposed algorithms are either the first with proven sub-linear $α$-regret bounds or have better $α$-regret bounds than the state of the art, where $α$ is a corresponding approximation bound in the offline setting. In the monotone setting, the proposed approach gives state-of-the-art sub-linear $α$-regret bounds among projection-free algorithms in 7 of the 8 considered cases while matching the result of the remaining case. Additionally, this paper addresses semi-bandit and bandit feedback for adversarial DR-submodular optimization, advancing the understanding of this optimization area.
LGFeb 13, 2024
A Unified Framework for Analyzing Meta-algorithms in Online Convex OptimizationMohammad Pedramfar, Vaneet Aggarwal
In this paper, we analyze the problem of online convex optimization in different settings, including different feedback types (full-information/semi-bandit/bandit/etc) in either stochastic or non-stochastic setting and different notions of regret (static adversarial regret/dynamic regret/adaptive regret). This is done through a framework which allows us to systematically propose and analyze meta-algorithms for the various settings described above. We show that any algorithm for online linear optimization with fully adaptive adversaries is an algorithm for online convex optimization. We also show that any such algorithm that requires full-information feedback may be transformed to an algorithm with semi-bandit feedback with comparable regret bound. We further show that algorithms that are designed for fully adaptive adversaries using deterministic semi-bandit feedback can obtain similar bounds using only stochastic semi-bandit feedback when facing oblivious adversaries. We use this to describe general meta-algorithms to convert first order algorithms to zeroth order algorithms with comparable regret bounds. Our framework allows us to analyze online optimization in various settings, recovers several results in the literature with a simplified proof technique, and provides new results.
LGJul 14, 2025
Multi-Armed Sampling Problem and the End of ExplorationMohammad Pedramfar, Siamak Ravanbakhsh
This paper introduces the framework of multi-armed sampling, as the sampling counterpart to the optimization problem of multi-arm bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves these optimal regret bounds. Our theoretical results demonstrate that in contrast to optimization, sampling does not require exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolates and unifies multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe the multi-armed sampling framework, and our findings in this setting can have a foundational role in the study of sampling including recent neural samplers, akin to the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the need for exploration and the convergence properties of algorithm for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).
OCJan 30, 2025
Decentralized Projection-free Online Upper-Linearizable Optimization with Applications to DR-Submodular OptimizationYiyang Lu, Mohammad Pedramfar, Vaneet Aggarwal
We introduce a novel framework for decentralized projection-free optimization, extending projection-free methods to a broader class of upper-linearizable functions. Our approach leverages decentralized optimization techniques with the flexibility of upper-linearizable function frameworks, effectively generalizing traditional DR-submodular function optimization. We obtain the regret of $O(T^{1-θ/2})$ with communication complexity of $O(T^θ)$ and number of linear optimization oracle calls of $O(T^{2θ})$ for decentralized upper-linearizable function optimization, for any $0\le θ\le 1$. This approach allows for the first results for monotone up-concave optimization with general convex constraints and non-monotone up-concave optimization with general convex constraints. Further, the above results for first order feedback are extended to zeroth order, semi-bandit, and bandit feedback.
LGMay 26, 2023
A Unified Approach for Maximizing Continuous DR-submodular FunctionsMohammad Pedramfar, Christopher John Quinn, Vaneet Aggarwal
This paper presents a unified approach for maximizing continuous DR-submodular functions that encompasses a range of settings and oracle access types. Our approach includes a Frank-Wolfe type offline algorithm for both monotone and non-monotone functions, with different restrictions on the general convex set. We consider settings where the oracle provides access to either the gradient of the function or only the function value, and where the oracle access is either deterministic or stochastic. We determine the number of required oracle accesses in all cases. Our approach gives new/improved results for nine out of the sixteen considered cases, avoids computationally expensive projections in two cases, with the proposed framework matching performance of state-of-the-art approaches in the remaining five cases. Notably, our approach for the stochastic function value-based oracle enables the first regret bounds with bandit feedback for stochastic DR-submodular functions.
LGJan 28, 2020
Coagent Networks RevisitedModjtaba Shokrian Zini, Mohammad Pedramfar, Matthew Riemer et al.
Coagent networks formalize the concept of arbitrary networks of stochastic agents that collaborate to take actions in a reinforcement learning environment. Prominent examples of coagent networks in action include approaches to hierarchical reinforcement learning (HRL), such as those using options, which attempt to address the exploration exploitation trade-off by introducing abstract actions at different levels by sequencing multiple stochastic networks within the HRL agents. We first provide a unifying perspective on the many diverse examples that fall under coagent networks. We do so by formalizing the rules of execution in a coagent network, enabled by the novel and intuitive idea of execution paths in a coagent network. Motivated by parameter sharing in the hierarchical option-critic architecture, we revisit the coagent network theory and achieve a much shorter proof of the policy gradient theorem using our idea of execution paths, without any assumption on how parameters are shared among coagents. We then generalize our setting and proof to include the scenario where coagents act asynchronously. This new perspective and theorem also lead to more mathematically accurate and performant algorithms than those in the existing literature. Lastly, by running nonstationary RL experiments, we survey the performance and properties of different generalizations of option-critic models.