DC LGJun 24, 2023

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

Daniel Zou, Xinchen Jin, Xueyang Yu, Hao Zhang, James Demmel

arXiv:2306.13835v11.21 citationsh-index: 73Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of resource utilization for serving large models in fields like language and image understanding, but it is incremental as it builds on existing swapping techniques.

The paper tackles the problem of serving multiple large distributed deep learning models on a shared GPU cluster by developing Computron, a system that uses model parallel swapping to speed up parameter transfers, demonstrating feasibility and improved resource utilization with tests on randomized workloads.

Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster. Computron implements a model parallel swapping design that takes advantage of the aggregate CPU-GPU link bandwidth of a cluster to speed up model parameter transfers. This design makes swapping large models feasible and can improve resource utilization. We demonstrate that Computron successfully parallelizes model swapping on multiple GPUs, and we test it on randomized workloads to show how it can tolerate real world variability factors like burstiness and skewed request rates. Computron's source code is available at https://github.com/dlzou/computron.

View on arXiv PDF Code

Similar