DCLGJun 24, 2023

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

arXiv:2306.13835v11 citationsh-index: 73Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of resource utilization for serving large models in fields like language and image understanding, but it is incremental as it builds on existing swapping techniques.

The paper tackles the problem of serving multiple large distributed deep learning models on a shared GPU cluster by developing Computron, a system that uses model parallel swapping to speed up parameter transfers, demonstrating feasibility and improved resource utilization with tests on randomized workloads.

Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster. Computron implements a model parallel swapping design that takes advantage of the aggregate CPU-GPU link bandwidth of a cluster to speed up model parameter transfers. This design makes swapping large models feasible and can improve resource utilization. We demonstrate that Computron successfully parallelizes model swapping on multiple GPUs, and we test it on randomized workloads to show how it can tolerate real world variability factors like burstiness and skewed request rates. Computron's source code is available at https://github.com/dlzou/computron.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes