LGAIIRJun 6, 2022

Pessimistic Off-Policy Optimization for Learning to Rank

arXiv:2206.02593v43 citationsh-index: 37
Originality Incremental advance
AI Analysis

This work addresses the problem of robust and efficient ranking optimization for recommender systems, representing an incremental improvement over existing off-policy methods.

The paper tackles the challenge of off-policy optimization in recommender systems, where imbalanced logged data and combinatorial action spaces make learning difficult, by proposing a pessimistic approach that uses lower confidence bounds on click model parameters to select lists, resulting in outperformance over baselines in empirical comparisons.

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient, and we analyze it. We study its Bayesian and frequentist variants and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines and is both robust and general.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes