Meta Dynamic Pricing: Transfer Learning Across Experiments
This work addresses the challenge of efficient dynamic pricing in experiment-rich environments for applications like retail or finance, though it is incremental by building on Thompson sampling with a novel prior alignment technique.
The paper tackles the problem of learning shared structure across dynamic pricing experiments for related products, proposing a meta dynamic pricing algorithm that learns an unknown prior distribution online while conducting Thompson sampling experiments. The result shows that the algorithm's meta regret grows sublinearly in the number of products, significantly speeding up learning compared to prior-independent methods, as demonstrated on synthetic and real auto loan data.
We study the problem of learning shared structure \emph{across} a sequence of dynamic pricing experiments for related products. We consider a practical formulation where the unknown demand parameters for each product come from an unknown distribution (prior) that is shared across products. We then propose a meta dynamic pricing algorithm that learns this prior online while solving a sequence of Thompson sampling pricing experiments (each with horizon $T$) for $N$ different products. Our algorithm addresses two challenges: (i) balancing the need to learn the prior (\emph{meta-exploration}) with the need to leverage the estimated prior to achieve good performance (\emph{meta-exploitation}), and (ii) accounting for uncertainty in the estimated prior by appropriately "widening" the estimated prior as a function of its estimation error. We introduce a novel prior alignment technique to analyze the regret of Thompson sampling with a mis-specified prior, which may be of independent interest. Unlike prior-independent approaches, our algorithm's meta regret grows sublinearly in $N$, demonstrating that the price of an unknown prior in Thompson sampling can be negligible in experiment-rich environments (large $N$). Numerical experiments on synthetic and real auto loan data demonstrate that our algorithm significantly speeds up learning compared to prior-independent algorithms.