Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management
This work addresses revenue management problems with delayed feedback, such as in hotel bookings, by improving robustness under parameter shifts, though it is incremental in combining existing methods.
The paper tackles reinforcement learning for revenue management with delayed feedback by proposing choice-model-assisted RL, which uses a calibrated discrete choice model to impute delayed components at decision time. Experiments in a hotel booking simulator show no significant difference from a baseline in stationary settings, gains up to 12.4% under parameter shifts, and degradation of 1.4-2.6% under model misspecification.
We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-γ))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.