Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making
This work addresses the challenge of efficiently combining expert predictions in online learning scenarios, such as enhancing LLM performance, though it appears incremental by building on existing bandit and mixture-of-experts frameworks.
The paper tackles the problem of aggregating outputs from multiple experts in a bandit learning setting to achieve optimal collective decision-making, proposing two algorithms that combine voting mechanisms with theoretical no-regret guarantees and applying them to fine-tune large language models for improved accuracy.
We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.