A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning
This addresses efficiency issues in large-scale distributed deep learning training, but it appears incremental as it builds on existing methods like EASGD and AdaHessian.
The paper tackled the problem of straggler nodes due to failure in distributed deep learning systems by proposing a dynamic weighting strategy, resulting in improved convergence rates and test performance as demonstrated in experiments with varying worker numbers and communication periods.
The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.