Prabhat Ram

h-index9

3papers

331citations

Novelty33%

AI Score25

Ranked #165,000 of 194,257 authors (top 85%)#779 in DC (top 80%)

3 Papers

32.9DCJun 7, 2022Code

Tutel: Adaptive Mixture-of-Experts at Scale

Changho Hwang, Wei Cui, Yifan Xiong et al. · microsoft-research

Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.

13.6LGOct 21, 2021

MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Steven Farrell, Murali Emani, Jacob Balma et al.

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.

2.7SESep 4, 2018

Software Process Measurement and Related Challenges in Agile Software Development: A Multiple Case Study

Prabhat Ram, Pilar Rodriguez, Markku Oivo

Existing scientific literature highlights the importance of metrics in Agile Software Development (ASD). Still, empirical investigation into metrics in ASD is scarce, particularly in identifying the rationale and the operational challenges associated with metrics. Under the Q-Rapids project (Horizon 2020), we conducted a multiple case study at four Agile companies, using the Goal Question Metric (GQM) approach, to investigate the rationale explaining the choice of process metrics in ASD, and challenges faced in operationalizing them. Results reflect that companies are interested in assessing process aspects like velocity, testing performance, and estimation accuracy, and they prefer custom metrics for these assessments. Companies use metrics as a means to access and even capitalize on the data, erstwhile inaccessible due to technical or process constraints. However, development context of a company can hinder metrics operationalization, manifesting primarily as unavailability of the data required to measure metrics...(check the paper for full Abstract)