LGSep 3, 2025

Offline Contextual Bandit with Counterfactual Sample Identification

arXiv:2509.10520v1h-index: 6
Originality Highly original
AI Analysis

This addresses confounding issues in production contextual bandit systems, offering a novel solution for more accurate action evaluation.

The paper tackles the problem of confounding in contextual bandit reward models by introducing Counterfactual Sample Identification, which learns to identify successful actions through counterfactual comparisons, resulting in consistent performance improvements over direct models in synthetic and real-world deployments.

In production systems, contextual bandit approaches often rely on direct reward models that take both action and context as input. However, these models can suffer from confounding, making it difficult to isolate the effect of the action from that of the context. We present \emph{Counterfactual Sample Identification}, a new approach that re-frames the problem: rather than predicting reward, it learns to recognize which action led to a successful (binary) outcome by comparing it to a counterfactual action sampled from the logging policy under the same context. The method is theoretically grounded and consistently outperforms direct models in both synthetic experiments and real-world deployments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes