LGAIDCNov 6, 2023

Saturn: Efficient Multi-Large-Model Deep Learning

arXiv:2311.02840v1h-index: 9
Originality Incremental advance
AI Analysis

This addresses a domain-specific problem for users building large models by improving training efficiency, though it is incremental as it builds on existing systems challenges.

The paper tackles the inefficiency of multi-large-model training during tasks like model selection by proposing Saturn, a data system that jointly optimizes parallelism selection, GPU distribution, and scheduling, resulting in 39-49% lower runtimes compared to typical deep learning practice.

In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users building large models in this setting -- parallelism technique selection, distribution of GPUs over jobs, and scheduling. We then formalize these as a joint problem, and build a new system architecture to tackle these challenges simultaneously. Our evaluations show that our joint-optimization approach yields 39-49% lower model selection runtimes than typical current DL practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes