LGMay 26, 2022

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

arXiv:2205.13566v16.910 citationsh-index: 47

Originality Incremental advance

AI Analysis

This work addresses user engagement challenges in modern recommendation systems, offering a novel algorithmic approach that is incremental but tailored to dynamic user behavior.

The paper tackles the problem of user abandonment in multi-armed bandit systems, such as in online education or video recommendations, by proposing a new MAB-A model where abandonment probability depends on user state, and introduces ULCB and KL-ULCB algorithms that adjust exploration based on user feedback, achieving logarithmic regret with significant improvements over traditional methods in simulations.

Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not like the previous item. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results confirm our theoretical analysis and show that the proposed algorithms have significantly lower regrets than the traditional UCB and KL-UCB, and Q-learning-based algorithms.

View on arXiv PDF

Similar