Merwan Barlier

h-index3

8papers

95citations

Novelty54%

AI Score32

Ranked #127,953 of 194,257 authors (top 66%)#28,159 in LG (top 70%)

8 Papers

15.9AIJul 10

PromptPack: Scaling LLM Annotation Agents for Online Recommendation

Sebastian Koralewski, Merwan Barlier, Yulia Stolin et al.

Online recommendation platforms increasingly use Large Language Models (LLMs) to extract structured features from ad creatives. While deploying a single-call LLM annotation agent yields significant Click-Through Rate (CTR) improvements in our live production environment, per-creative prompting is prohibitively expensive to scale. The redundant system instructions sent in every request account for 94% of billed input tokens. To break this cost bottleneck, we introduce PromptPack, a scalable, high-throughput LLM annotation agent. PromptPack achieves this scale via in-context batching, combining a shared system prompt, a strict XML structural envelope, and an output correction layer to ensure deterministic, pipeline-ready feature extraction across multiple creatives simultaneously. We evaluate PromptPack via an offline retrieval benchmark using a downstream logistic-regression ranker. To deeply profile the agent's behavior, we measure AUC and introduce Volume-Weighted Absolute Lift (VWAL), a novel metric capturing the signal quality of the generated features. Compared to our live, unbatched production baseline, PromptPack at batch size 20 cuts our LLM costs by 89% and accelerates throughput by 2.5x while fully preserving AUC.

8.6MLSep 15, 2023

Price of Safety in Linear Best Arm Identification

Xuedong Shang, Igor Colin, Merwan Barlier et al.

We introduce the safe best-arm identification framework with linear feedback, where the agent is subject to some stage-wise safety constraint that linearly depends on an unknown parameter vector. The agent must take actions in a conservative way so as to ensure that the safety constraint is not violated with high probability at each round. Ways of leveraging the linear structure for ensuring safety has been studied for regret minimization, but not for best-arm identification to the best our knowledge. We propose a gap-based algorithm that achieves meaningful sample complexity while ensuring the stage-wise safety. We show that we pay an extra term in the sample complexity due to the forced exploration phase incurred by the additional safety constraint. Experimental illustrations are provided to justify the design of our algorithm.

3.8LGSep 15, 2023

Adaptive Sample Sharing for Multi Agent Linear Bandits

Hamza Cherkaoui, Merwan Barlier, Igor Colin

The multi-agent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.

7.9LGFeb 21, 2024Code

Enhancing Reinforcement Learning Agents with Local Guides

Paul Daoudi, Bogdan Robu, Christophe Prieur et al.

This paper addresses the problem of integrating local guide policies into a Reinforcement Learning agent. For this, we show how to adapt existing algorithms to this setting before introducing a novel algorithm based on a noisy policy-switching procedure. This approach builds on a proper Approximate Policy Evaluation (APE) scheme to provide a perturbation that carefully leads the local guides towards better actions. We evaluated our method on a set of classical Reinforcement Learning problems, including safety-critical systems where the agent cannot enter some areas at the risk of triggering catastrophic consequences. In all the proposed environments, our agent proved to be efficient at leveraging those policies to improve the performance of any APE-based Reinforcement Learning algorithm, especially in its first learning stages.

2.3SYFeb 21, 2024

Improving a Proportional Integral Controller with Reinforcement Learning on a Throttle Valve Benchmark

Paul Daoudi, Bojan Mavkov, Bogdan Robu et al.

This paper presents a learning-based control strategy for non-linear throttle valves with an asymmetric hysteresis, leading to a near-optimal controller without requiring any prior knowledge about the environment. We start with a carefully tuned Proportional Integrator (PI) controller and exploit the recent advances in Reinforcement Learning (RL) with Guides to improve the closed-loop behavior by learning from the additional interactions with the valve. We test the proposed control method in various scenarios on three different valves, all highlighting the benefits of combining both PI and RL frameworks to improve control performance in non-linear stochastic systems. In all the experimental test cases, the resulting agent has a better sample efficiency than traditional RL agents and outperforms the PI controller.

9.4LGJan 31, 2025

Differentially Private Policy Gradient

Alexandre Rio, Merwan Barlier, Igor Colin

Motivated by the increasing deployment of reinforcement learning in the real world, involving a large consumption of personal data, we introduce a differentially private (DP) policy gradient algorithm. We show that, in this setting, the introduction of Differential Privacy can be reduced to the computation of appropriate trust regions, thus avoiding the sacrifice of theoretical properties of the DP-less methods. Therefore, we show that it is possible to find the right trade-off between privacy noise and trust-region size to obtain a performant differentially private policy gradient algorithm. We then outline its performance empirically on various benchmarks. Our results and the complexity of the tasks addressed represent a significant improvement over existing DP algorithms in online RL.

3.8LGDec 24, 2023Code

A Conservative Approach for Few-Shot Transfer in Off-Dynamics Reinforcement Learning

Paul Daoudi, Christophe Prieur, Bogdan Robu et al.

Off-dynamics Reinforcement Learning (ODRL) seeks to transfer a policy from a source environment to a target environment characterized by distinct yet similar dynamics. In this context, traditional RL agents depend excessively on the dynamics of the source environment, resulting in the discovery of policies that excel in this environment but fail to provide reasonable performance in the target one. In the few-shot framework, a limited number of transitions from the target environment are introduced to facilitate a more effective transfer. Addressing this challenge, we propose an innovative approach inspired by recent advancements in Imitation Learning and conservative RL algorithms. The proposed method introduces a penalty to regulate the trajectories generated by the source-trained policy. We evaluate our method across various environments representing diverse off-dynamics conditions, where access to the target environment is extremely limited. These experiments include high-dimensional systems relevant to real-world applications. Across most tested scenarios, our proposed method demonstrates performance improvements compared to existing baselines.

4.6LGFeb 8, 2024

Differentially Private Deep Model-Based Reinforcement Learning

Alexandre Rio, Merwan Barlier, Igor Colin et al.

We address private deep offline reinforcement learning (RL), where the goal is to train a policy on standard control tasks that is differentially private (DP) with respect to individual trajectories in the dataset. To achieve this, we introduce PriMORL, a model-based RL algorithm with formal differential privacy guarantees. PriMORL first learns an ensemble of trajectory-level DP models of the environment from offline data. It then optimizes a policy on the penalized private model, without any further interaction with the system or access to the dataset. In addition to offering strong theoretical foundations, we demonstrate empirically that PriMORL enables the training of private RL agents on offline continuous control tasks with deep function approximations, whereas current methods are limited to simpler tabular and linear Markov Decision Processes (MDPs). We furthermore outline the trade-offs involved in achieving privacy in this setting.