DCMay 27

Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain

arXiv:2605.2791866.2h-index: 8

Predicted impact top 23% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners training multimodal LLMs, Entrain provides a static model-parallel configuration that eliminates the need for dynamic parallelism, simplifying distributed training.

Entrain addresses workload imbalance in distributed multimodal LLM training caused by data heterogeneity, achieving up to 10.6× reduction in variability and 1.40× throughput improvement over baselines.

Multimodal LLM datasets are inherently heterogeneous, with significant data variability. Although each modality exhibits independent variability, sample-level entanglement makes it difficult to balance workloads across both modalities and batches. We present Entrain, a distributed MLLM training framework that addresses both heterogeneity and variability in multimodal training workloads. Entrain challenges the intuition that dynamic data variability requires dynamic model parallelism by shifting the profiling paradigm from micro-level samples to macroscopic batches. We prove that a single, static model-parallel configuration suffices for optimal load balancing under this paradigm. At the microscopic scale, Entrain introduces a hierarchical microbatch assignment algorithm that defers excess workload within each iteration to stabilize variability across microbatches. Evaluations show that Entrain reduces workload variability across microbatches by up to 10.6$\times$, improving end-to-end training throughput by up to 1.40$\times$ over existing baselines.

View on arXiv PDF

Similar