Demystifying the MLPerf Benchmark Suite
This work provides insights for optimizing distributed deep learning systems, but it is incremental as it builds on prior benchmarks and corroborates existing techniques.
The study analyzed the MLPerf benchmark suite, finding that it reveals system bottlenecks like the need for low-latency GPU interconnects and smart scheduling for multi-GPU training, with observations including variations in scaling efficiency and increased CPU utilization with more GPUs.
MLPerf, an emerging machine learning benchmark suite strives to cover a broad range of applications of machine learning. We present a study on its characteristics and how the MLPerf benchmarks differ from some of the previous deep learning benchmarks like DAWNBench and DeepBench. We find that application benchmarks such as MLPerf (although rich in kernels) exhibit different features compared to kernel benchmarks such as DeepBench. MLPerf benchmark suite contains a diverse set of models which allows unveiling various bottlenecks in the system. Based on our findings, dedicated low latency interconnect between GPUs in multi-GPU systems is required for optimal distributed deep learning training. We also observe variation in scaling efficiency across the MLPerf models. The variation exhibited by the different models highlight the importance of smart scheduling strategies for multi-GPU training. Another observation is that CPU utilization increases with increase in number of GPUs used for training. Corroborating prior work we also observe and quantify improvements possible by compiler optimizations, mixed-precision training and use of Tensor Cores.