LGAIMay 26

Linear and Neural Dueling Bandits with Delayed Feedback

arXiv:2605.2655446.8
AI Analysis

It addresses the practical problem of delayed feedback in preference-based decision-making for recommender systems and LLM alignment, where existing dueling bandit algorithms fail.

This paper formalizes contextual dueling bandits with stochastic delayed feedback and proposes two algorithms (LDB-DF and NDB-DF) that achieve O(d√T) regret for linear and sub-linear regret for neural settings, validated on simulated and real-world data.

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes