Joint Online Learning and Decision-making via Dual Mirror Descent
This work addresses incremental improvements in online decision-making with constraints for applications like revenue management, by integrating learning and optimization in a dual framework.
The paper tackles the problem of online revenue maximization with cost constraints over a finite horizon, where an agent makes adaptive decisions based on context vectors from an unknown distribution. It proposes a new algorithm combining dual mirror descent with parameter learning, achieving O(√T) regret and constraint violations when parameters are known, and similar bounds plus learning-dependent terms when parameters are unknown.
We consider an online revenue maximization problem over a finite time horizon subject to lower and upper bounds on cost. At each period, an agent receives a context vector sampled i.i.d. from an unknown distribution and needs to make a decision adaptively. The revenue and cost functions depend on the context vector as well as some fixed but possibly unknown parameter vector to be learned. We propose a novel offline benchmark and a new algorithm that mixes an online dual mirror descent scheme with a generic parameter learning process. When the parameter vector is known, we demonstrate an $O(\sqrt{T})$ regret result as well an $O(\sqrt{T})$ bound on the possible constraint violations. When the parameter is not known and must be learned, we demonstrate that the regret and constraint violations are the sums of the previous $O(\sqrt{T})$ terms plus terms that directly depend on the convergence of the learning process.