SEDec 12, 2021
Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveXuanzhe Liu, Diandian Gu, Zhenpeng Chen et al.
Deep learning (DL) has become a key component of modern software. In the "big model" era, the rich features of DL-based software substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices in the training process, which is known as distributed deep learning training, or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers' issues in distributed training. To this end, we analyze 1,131 real-world developers' issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.
SEJan 10, 2021
An Empirical Study on Serverless Workflow ServiceJinfeng Wen, Yi Liu
Along with the wide-adoption of Serverless Computing, more and more applications are developed and deployed on cloud platforms. Major cloud providers present their serverless workflow services to orchestrate serverless functions, making it possible to perform complex applications effectively. A comprehensive instruction is necessary to help developers understand the pros and cons, and make better choices among these serverless workflow services. However, the characteristics of these serverless workflow services have not been systematically analyzed. To fill the knowledge gap, we survey four mainstream serverless workflow services, investigating their characteristics and performance. Specifically, we review their official documents and compare them in terms of seven dimensions including programming model, state management, etc. Then, we compare the performance (i.e., execution time of functions, execution time of workflows, orchestration overhead of workflows) under various experimental settings considering activity complexity and data-flow complexity of workflows, as well as function complexity of serverless functions. Finally, we discuss and verify the service effectiveness for two actual workloads. Our findings could help application developers and serverless providers to improve the development efficiency and user experience.
SEDec 2, 2020
Characterizing Commodity Serverless Computing PlatformsJinfeng Wen, Yi Liu, Zhenpeng Chen et al.
Serverless computing has become a new trending paradigm in cloud computing, allowing developers to focus on the development of core application logic and rapidly construct the prototype via the composition of independent functions. With the development and prosperity of serverless computing, major cloud vendors have successively rolled out their commodity serverless computing platforms. However, the characteristics of these platforms have not been systematically studied. Measuring these characteristics can help developers to select the most adequate serverless computing platform and develop their serverless-based applications in the right way. To fill this knowledge gap, we present a comprehensive study on characterizing mainstream commodity serverless computing platforms, including AWS Lambda, Google Cloud Functions, Azure Functions, and Alibaba Cloud Function Compute. Specifically, we conduct both qualitative analysis and quantitative analysis. In qualitative analysis, we compare these platforms from three aspects (i.e., development, deployment, and runtime) based on their official documentation to construct a taxonomy of characteristics. In quantitative analysis, we analyze the runtime performance of these platforms from multiple dimensions with well-designed benchmarks. First, we analyze three key factors that can influence the startup latency of serverless-based applications. Second, we compare the resource efficiency of different platforms with 16 representative benchmarks. Finally, we measure their performance difference when dealing with different concurrent requests, and explore the potential causes in a black-box fashion. Based on the results of both qualitative and quantitative analysis, we derive a series of findings and provide insightful implications for both developers and cloud vendors.