Group-Sensitive Offline Contextual Bandits
This work tackles fairness concerns in offline contextual bandits for applications where resources are limited, representing an incremental improvement with specific fairness constraints.
The paper addresses the problem of reward disparities across groups in offline contextual bandits, where optimizing for overall expected rewards can unfairly benefit some groups over others. The authors propose a constrained policy optimization framework with group-sensitive fairness constraints, demonstrating through experiments that it effectively reduces reward disparities while maintaining competitive overall performance.
Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.