DeepStability: A Study of Unstable Numerical Methods and Their Solutions in Deep Learning
It addresses the problem of numerical instability in deep learning software for developers and tool builders, though it is incremental as it builds on existing libraries and methods.
The paper studied PyTorch and TensorFlow to identify numerically unstable deep learning algorithms, analyzing root causes and patches, and created the first database of such issues and solutions, with a fix accepted in TensorFlow.
Deep learning (DL) has become an integral part of solutions to various important problems, which is why ensuring the quality of DL systems is essential. One of the challenges of achieving reliability and robustness of DL software is to ensure that algorithm implementations are numerically stable. DL algorithms require a large amount and a wide variety of numerical computations. A naive implementation of numerical computation can lead to errors that may result in incorrect or inaccurate learning and results. A numerical algorithm or a mathematical formula can have several implementations that are mathematically equivalent, but have different numerical stability properties. Designing numerically stable algorithm implementations is challenging, because it requires an interdisciplinary knowledge of software engineering, DL, and numerical analysis. In this paper, we study two mature DL libraries PyTorch and Tensorflow with the goal of identifying unstable numerical methods and their solutions. Specifically, we investigate which DL algorithms are numerically unstable and conduct an in-depth analysis of the root cause, manifestation, and patches to numerical instabilities. Based on these findings, we launch, the first database of numerical stability issues and solutions in DL. Our findings and provide future references to developers and tool builders to prevent, detect, localize and fix numerically unstable algorithm implementations. To demonstrate that, using {\it DeepStability} we have located numerical stability issues in Tensorflow, and submitted a fix which has been accepted and merged in.