Statistical Quality and Reproducibility of Pseudorandom Number Generators in Machine Learning technologies
This addresses a critical issue for ML practitioners and researchers who rely on PRNGs for reproducibility and robustness in tasks like data shuffling and weight initialization, though it is incremental as it builds on existing test suites.
The paper tackles the problem of statistical quality and reproducibility of pseudorandom number generators (PRNGs) in machine learning frameworks, finding that even 'crush-resistant' generators like PCG and Philox can fail certain statistical tests, with differences observed between native and framework-integrated versions.
Machine learning (ML) frameworks rely heavily on pseudorandom number generators (PRNGs) for tasks such as data shuffling, weight initialization, dropout, and optimization. Yet, the statistical quality and reproducibility of these generators-particularly when integrated into frameworks like PyTorch, TensorFlow, and NumPy-are underexplored. In this paper, we compare the statistical quality of PRNGs used in ML frameworks (Mersenne Twister, PCG, and Philox) against their original C implementations. Using the rigorous TestU01 BigCrush test suite, we evaluate 896 independent random streams for each generator. Our findings challenge claims of statistical robustness, revealing that even generators labeled ''crush-resistant'' (e.g., PCG, Philox) may fail certain statistical tests. Surprisingly, we can observe some differences in failure profiles between the native and framework-integrated versions of the same algorithm, highlighting some implementation differences that may exist.