Rongquan Bai

LG
h-index25
8papers
74citations
Novelty53%
AI Score45

8 Papers

LGOct 13, 2022
Sustainable Online Reinforcement Learning for Auto-bidding

Zhiyu Mou, Yusen Huo, Rongquan Bai et al.

Recently, auto-bidding technique has become an essential tool to increase the revenue of advertisers. Facing the complex and ever-changing bidding environments in the real-world advertising system (RAS), state-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers. Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS. In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline (IBOO). Firstly, we formally define the IBOO and systematically analyze its causes and influences. Then, to avoid the IBOO, we propose a sustainable online RL (SORL) framework that trains the auto-bidding policy by directly interacting with the RAS, instead of learning in the VAS. Specifically, based on our proof of the Lipschitz smooth property of the Q function, we design a safe and efficient online exploration (SER) policy for continuously collecting data from the RAS. Meanwhile, we derive the theoretical lower bound on the safety of the SER policy. We also develop a variance-suppressed conservative Q-learning (V-CQL) method to effectively and stably learn the auto-bidding policy with the collected data. Finally, extensive simulated and real-world experiments validate the superiority of our approach over the state-of-the-art auto-bidding algorithm.

GTMar 5, 2024
MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display Advertising

Zhen Gong, Lvyin Niu, Yang Zhao et al.

Online bidding and auction are crucial aspects of the online advertising industry. Conventionally, there is only one slot for ad display and most current studies focus on it. Nowadays, multi-slot display advertising is gradually becoming popular where many ads could be displayed in a list and shown as a whole to users. However, multi-slot display advertising leads to different cost-effectiveness. Advertisers have the incentive to adjust bid prices so as to win the most economical ad positions. In this study, we introduce bid shading into multi-slot display advertising for bid price adjustment with a Multi-task End-to-end Bid Shading(MEBS) method. We prove the optimality of our method theoretically and examine its performance experimentally. Through extensive offline and online experiments, we demonstrate the effectiveness and efficiency of our method, and we obtain a 7.01% lift in Gross Merchandise Volume, a 7.42% lift in Return on Investment, and a 3.26% lift in ad buy count.

GTOct 17, 2025
HOB: A Holistically Optimized Bidding Strategy under Heterogeneous Auction Mechanisms with Organic Traffic

Qi Li, Wendong Huang, Qichen Ye et al.

The E-commerce advertising platforms typically sell commercial traffic through either second-price auction (SPA) or first-price auction (FPA). SPA was historically prevalent due to its dominant strategy incentive-compatible (DSIC) for bidders with quasi-linear utilities, especially when budgets are not a binding constraint, while FPA has gained more prominence for offering higher revenue potential to publishers and avoiding the possibility for discriminatory treatment in personalized reserve prices. Meanwhile, on the demand side, advertisers are increasingly adopting platform-wide marketing solutions akin to QuanZhanTui, shifting from spending budgets solely on commercial traffic to bidding on the entire traffic for the purpose of maximizing overall sales. For automated bidding systems, such a trend poses a critical challenge: determining optimal strategies across heterogeneous auction channels to fulfill diverse advertiser objectives, such as maximizing return (MaxReturn) or meeting target return on ad spend (TargetROAS). To overcome this challenge, this work makes two key contributions. First, we derive an efficient solution for optimal bidding under FPA channels, which takes into account the presence of organic traffic - traffic can be won for free. Second, we introduce a marginal cost alignment (MCA) strategy that provably secures bidding efficiency across heterogeneous auction mechanisms. To validate performance of our developed framework, we conduct comprehensive offline experiments on public datasets and large-scale online A/B testing, which demonstrate consistent improvements over existing methods.

LGOct 9, 2025
A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization

Yiqin Lv, Zhiyu Mou, Miao Xu et al. · tsinghua

In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.

LGSep 19, 2025
Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Zhiyu Mou, Yiqin Lv, Miao Xu et al.

Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

LGJun 22, 2025
Permutation Equivariant Model-based Offline Reinforcement Learning for Auto-bidding

Zhiyu Mou, Miao Xu, Wei Chen et al.

Reinforcement learning (RL) for auto-bidding has shifted from using simplistic offline simulators (Simulation-based RL Bidding, SRLB) to offline RL on fixed real datasets (Offline RL Bidding, ORLB). However, ORLB policies are limited by the dataset's state space coverage, offering modest gains. While SRLB expands state coverage, its simulator-reality gap risks misleading policies. This paper introduces Model-based RL Bidding (MRLB), which learns an environment model from real data to bridge this gap. MRLB trains policies using both real and model-generated data, expanding state coverage beyond ORLB. To ensure model reliability, we propose: 1) A permutation equivariant model architecture for better generalization, and 2) A robust offline Q-learning method that pessimistically penalizes model errors. These form the Permutation Equivariant Model-based Offline RL (PE-MORL) algorithm. Real-world experiments show that PE-MORL outperforms state-of-the-art auto-bidding methods.

LGMar 13, 2025
Nash Equilibrium Constrained Auto-bidding With Bi-level Reinforcement Learning

Zhiyu Mou, Miao Xu, Rongquan Bai et al.

Many online advertising platforms provide advertisers with auto-bidding services to enhance their advertising performance. However, most existing auto-bidding algorithms fail to accurately capture the auto-bidding problem formulation that the platform truly faces, let alone solve it. Actually, we argue that the platform should try to help optimize each advertiser's performance to the greatest extent -- which makes $ε$-Nash Equilibrium ($ε$-NE) a necessary solution concept -- while maximizing the social welfare of all the advertisers for the platform's long-term value. Based on this, we introduce the \emph{Nash-Equilibrium Constrained Bidding} (NCB), a new formulation of the auto-bidding problem from the platform's perspective. Specifically, it aims to maximize the social welfare of all advertisers under the $ε$-NE constraint. However, the NCB problem presents significant challenges due to its constrained bi-level structure and the typically large number of advertisers involved. To address these challenges, we propose a \emph{Bi-level Policy Gradient} (BPG) framework with theoretical guarantees. Notably, its computational complexity is independent of the number of advertisers, and the associated gradients are straightforward to compute. Extensive simulated and real-world experiments validate the effectiveness of the BPG framework.

LGJun 29, 2020
Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising

Xiaotian Hao, Zhaoqing Peng, Yi Ma et al.

In E-commerce, advertising is essential for merchants to reach their target users. The typical objective is to maximize the advertiser's cumulative revenue over a period of time under a budget constraint. In real applications, an advertisement (ad) usually needs to be exposed to the same user multiple times until the user finally contributes revenue (e.g., places an order). However, existing advertising systems mainly focus on the immediate revenue with single ad exposures, ignoring the contribution of each exposure to the final conversion, thus usually falls into suboptimal solutions. In this paper, we formulate the sequential advertising strategy optimization as a dynamic knapsack problem. We propose a theoretically guaranteed bilevel optimization framework, which significantly reduces the solution space of the original optimization space while ensuring the solution quality. To improve the exploration efficiency of reinforcement learning, we also devise an effective action space reduction approach. Extensive offline and online experiments show the superior performance of our approaches over state-of-the-art baselines in terms of cumulative revenue.