ARMar 7

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

arXiv:2603.07006v1
AI Analysis

This work aims to improve the training efficiency of MoE-based LLMs for researchers and practitioners working with large-scale modularized models, offering an incremental improvement in hardware utilization and parallelization.

This paper addresses the hardware deployment challenges of Mixture-of-Experts (MoE) architectures for Large Language Models (LLMs) by proposing Mozart, an algorithm-hardware co-design framework. Mozart introduces an expert allocation strategy and a fine-grained scheduling mechanism, along with an adaptive co-location of heterogeneous modules on specialized chiplets, resulting in significant efficiency gains across three popular MoE models.

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes