Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters
This enables more efficient training of large-scale models for domains like scientific data analysis, though it is incremental as it applies existing mixed-precision techniques to distributed settings.
The paper tackles training deep recurrent neural networks using half-precision floats on GPU clusters, showing that it reduces memory and network bandwidth while achieving comparable test performance to single precision, with strong scaling up to O(100) workers.
In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to $O(100)$ workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset~\cite{imdb}. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.