Fully Gap-Dependent Bounds for Multinomial Logit Bandit
This work provides improved theoretical guarantees for assortment optimization in online retail, though it is incremental as it builds on prior bandit methods.
The paper tackles the multinomial logit bandit problem by developing algorithms that achieve gap-dependent bounds for identifying the optimal assortment and minimizing regret, with time steps scaling as Õ(∑Δ_i^{-2}) and regret as O(∑KΔ_i^{-1} log T).
We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items, and the buyer purchases an item from the assortment according to a MNL choice model. The objective is to learn the model parameters and maximize the expected revenue. We present (i) an algorithm that identifies the optimal assortment $S^*$ within $\widetilde{O}(\sum_{i = 1}^N Δ_i^{-2})$ time steps with high probability, and (ii) an algorithm that incurs $O(\sum_{i \notin S^*} KΔ_i^{-1} \log T)$ regret in $T$ time steps. To our knowledge, our algorithms are the first to achieve gap-dependent bounds that fully depends on the suboptimality gaps of all items. Our technical contributions include an algorithmic framework that relates the MNL-bandit problem to a variant of the top-$K$ arm identification problem in multi-armed bandits, a generalized epoch-based offering procedure, and a layer-based adaptive estimation procedure.