MECRDec 23, 2019

Study on upper limit of sample sizes for a two-level test in NIST SP800-22

arXiv:1912.10602v34 citations
Originality Synthesis-oriented
AI Analysis

This work provides incremental improvements for practitioners using NIST SP800-22 to test pseudorandom number generators, helping avoid false rejections and potentially increasing test sensitivity.

The paper addresses the issue of the two-level test in NIST SP800-22 rejecting good pseudorandom number generators due to large sample sizes by proposing practical upper limits for six tests, derived from chi-squared discrepancies, and shows that using these limits yields appropriate results while risky sizes often cause false rejections.

NIST SP800-22 is one of the most widely used statistical testing tools for pseudorandom number generators (PRNGs). This tool consists of 15 tests (one-level tests) and two additional tests (two-level tests). Each one-level test provides one or more $p$-values. The two-level tests measure the uniformity of the obtained $p$-values for a fixed one-level test. One of the two-level tests categorizes the $p$-values into ten intervals of equal length, and apply a chi-squared goodness-of-fit test. This two-level test is often more powerful than one-level tests, but sometimes it rejects even good PRNGs when the sample size at the second level is too large, since it detects approximation errors in the computation of $p$-values. In this paper, we propose a practical upper limit of the sample size in this two-level test, for each of six tests appeared in SP800-22. These upper limits are derived by the chi-squared discrepancy between the distribution of the approximated $p$-values and the uniform distribution $U(0, 1)$. We also computed a "risky" sample size at the second level for each one-level test. Our experiments show that the two-level test with the proposed upper limit gives appropriate results, while using the risky size often rejects even good PRNGs. We also propose another improvement: to use the exact probability for the ten categories in the computation of goodness-of-fit at the two-level test. This allows us to increase the sample size at the second level, and would make the test more sensitive than the NIST's recommending usage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes