DCAIMar 31, 2025

Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training

arXiv:2503.23830v22 citationsh-index: 7
AI Analysis

This addresses inefficiencies in MLLM training for researchers and developers, offering a scalable solution to accelerate model development, though it is incremental as it builds on existing training frameworks.

The paper tackles the problem of Modality Composition Incoherence in multimodal large language model (MLLM) training, which causes mini-batch imbalances and reduces efficiency, by introducing OrchMLLM, a framework that achieves a Model FLOPs Utilization of 41.6% and outperforms Megatron-LM by up to 3.1x in throughput when training an 84B MLLM.

Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6\%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1\times$ in throughput.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes