Sebastian Perez-Salazar

LG
3papers
2citations
Novelty55%
AI Score39

3 Papers

45.6GTMay 15
Online Contract Selection for Continual Coverage

Qinge Chi, Sebastian Perez-Salazar

Motivated by applications where a system must remain operational via continual procurement of contracts, we study two online contract selection problems under uncertain prices. At each time step, a price drawn from a known distribution is revealed online, and the decision-maker may initiate a contract of arbitrary duration, incurring a cost equal to the product of the price and the contract length; moreover, every time period must be covered by at least one active contract. We consider two models depending on how contracts cover time: a \emph{deferred model}, in which contracts are queued back-to-back, and a \emph{concurrent model}, in which contracts become active immediately and may overlap. In both settings, we seek online algorithms that minimize their competitive ratio, i.e., the ratio between the expected cost incurred by the online algorithm and the expected offline optimal cost when all prices are known in advance. We first focus on the case where prices are independent and identically distributed (i.i.d.). For the deferred model, we characterize exactly the worst-case optimal competitive ratio, which is asymptotically $ζ^* \approx 2.472$ as the time horizon grows. For the concurrent model, we prove a lower bound of $ζ^*$ on the optimal competitive ratio and an asymptotic competitive ratio of at most $4.179$. These bounds improve upon the current lower bound of $2.148$ and upper bound of $6.052$ on the optimal competitive ratio. For both models, our algorithms are quantile-based that can be easily translated into practical threshold-based algorithms for any distribution. Our proofs follow from linear programs and duality arguments in quantile spaces. Lastly, we show that, in both models, no finite competitive ratio exists when the prices are still independent but not necessarily identically distributed, proving a striking division in the two price settings.

LGMay 20, 2023
On First-Order Meta-Reinforcement Learning with Moreau Envelopes

Mohammad Taha Toghani, Sebastian Perez-Salazar, César A. Uribe

Meta-Reinforcement Learning (MRL) is a promising framework for training agents that can quickly adapt to new environments and tasks. In this work, we study the MRL problem under the policy gradient formulation, where we propose a novel algorithm that uses Moreau envelope surrogate regularizers to jointly learn a meta-policy that is adjustable to the environment of each individual task. Our algorithm, called Moreau Envelope Meta-Reinforcement Learning (MEMRL), learns a meta-policy that can adapt to a distribution of tasks by efficiently updating the policy parameters using a combination of gradient-based optimization and Moreau Envelope regularization. Moreau Envelopes provide a smooth approximation of the policy optimization problem, which enables us to apply standard optimization techniques and converge to an appropriate stationary point. We provide a detailed analysis of the MEMRL algorithm, where we show a sublinear convergence rate to a first-order stationary point for non-convex policy gradient optimization. We finally show the effectiveness of MEMRL on a multi-task 2D-navigation problem.

LGOct 24, 2020
Differentially Private Online Submodular Maximization

Sebastian Perez-Salazar, Rachel Cummings

In this work we consider the problem of online submodular maximization under a cardinality constraint with differential privacy (DP). A stream of $T$ submodular functions over a common finite ground set $U$ arrives online, and at each time-step the decision maker must choose at most $k$ elements of $U$ before observing the function. The decision maker obtains a payoff equal to the function evaluated on the chosen set, and aims to learn a sequence of sets that achieves low expected regret. In the full-information setting, we develop an $(\varepsilon,δ)$-DP algorithm with expected $(1-1/e)$-regret bound of $\mathcal{O}\left( \frac{k^2\log |U|\sqrt{T \log k/δ}}{\varepsilon} \right)$. This algorithm contains $k$ ordered experts that learn the best marginal increments for each item over the whole time horizon while maintaining privacy of the functions. In the bandit setting, we provide an $(\varepsilon,δ+ O(e^{-T^{1/3}}))$-DP algorithm with expected $(1-1/e)$-regret bound of $\mathcal{O}\left( \frac{\sqrt{\log k/δ}}{\varepsilon} (k (|U| \log |U|)^{1/3})^2 T^{2/3} \right)$. Our algorithms contains $k$ ordered experts that learn the best marginal item to select given the items chosen her predecessors, while maintaining privacy of the functions. One challenge for privacy in this setting is that the payoff and feedback of expert $i$ depends on the actions taken by her $i-1$ predecessors. This particular type of information leakage is not covered by post-processing, and new analysis is required. Our techniques for maintaining privacy with feedforward may be of independent interest.