Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
This provides a flexible and high-fidelity emulation tool for researchers and practitioners in distributed machine learning, though it is incremental as it builds on existing emulation concepts.
The paper tackles the problem of accurately emulating distributed deep neural network (DNN) training workloads by proposing NeuronaBox, which executes training on a subset of real nodes and emulates the networked environment and communication operations, achieving an error margin of less than 1% compared to real systems.
We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.