Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
This work addresses training inefficiencies for large-scale AI models, offering a domain-specific solution that is incremental in nature.
The paper tackled the problem of data-induced imbalances in large Transformer model training, which arise from uneven sequence lengths and packing discrepancies, by developing Hydraulis to jointly optimize parallel strategies and data assignment, resulting in performance improvements of 1.32-2.66 times over existing systems.
To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.