MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models
This provides a practical tool for researchers and developers working on MoE systems across various domains, but it is incremental as it builds on existing MoE concepts with new implementation methods.
The paper tackles the challenge of building and analyzing Mixture-of-Experts (MoE) models by introducing MixtureKit, a modular framework that supports three methods for composing and training such models, and experiments show that a BTX-based model trained with it outperforms dense baselines on multilingual benchmarks.
We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) \emph{Traditional MoE}, which uses a single router per transformer block to select experts, (ii) \emph{BTX} (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) \emph{BTS} (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions, expert weight distributions, and layer-wise contributions. Experiments with multilingual code-switched data (e.g. Arabic-Latin) show that a BTX-based model trained using MixtureKit can outperform baseline dense models on multiple benchmarks. We release MixtureKit as a practical foundation for research and development of MoE-based systems across diverse domains.