LGJan 17, 2024

Cascading Reinforcement Learning

arXiv:2401.08961v44.62 citationsh-index: 3ICLR

Originality Incremental advance

AI Analysis

This work addresses a gap in cascading bandits for recommendation systems by modeling dynamic user states, though it appears incremental as it extends prior models with state considerations.

The paper tackles the problem of incorporating user states and state transitions into cascading bandit models for recommendation systems, proposing a cascading RL framework that addresses computational challenges with efficient algorithms, resulting in improved computational and sample efficiencies compared to existing methods.

Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.

View on arXiv PDF

Similar