Mixture of Raytraced Experts
This work addresses the problem of inefficient and rigid computation in MoE models for machine learning researchers, offering an incremental improvement with potential for faster and more expressive designs.
The paper tackles the limitation of fixed computation in Mixture of Experts (MoE) architectures by introducing a stacked MoE that dynamically selects expert sequences, enabling variable computational graphs and increasing accuracy with more cycles. Preliminary experiments show a 10% to 40% reduction in training epochs while maintaining or improving accuracy.
We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10\% to 40\% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at https://github.com/nutig/RayTracing