LGDCMay 5, 2024

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

arXiv:2405.02969v12 citationsh-index: 5APSys
Originality Incremental advance
AI Analysis

This provides a flexible and high-fidelity emulation tool for researchers and practitioners in distributed machine learning, though it is incremental as it builds on existing emulation concepts.

The paper tackles the problem of accurately emulating distributed deep neural network (DNN) training workloads by proposing NeuronaBox, which executes training on a subset of real nodes and emulates the networked environment and communication operations, achieving an error margin of less than 1% compared to real systems.

We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes