LGOct 17, 2025

Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

arXiv:2510.15429v17.11 citationsh-index: 1

Originality Incremental advance

AI Analysis

It addresses safety and efficiency challenges in real-world applications like recommendation systems and generative AI, with incremental theoretical and algorithmic contributions.

This dissertation developed reinforcement learning methods for ranking and diffusion models to ensure safety, efficiency, and robustness, achieving guarantees against performance degradation in ranking systems and improved sample efficiency and alignment in text-to-image generation.

This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines the trade-offs between efficiency and effectiveness in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO's clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.

View on arXiv PDF

Similar