LGAug 13, 2024

Robust Black-box Testing of Deep Neural Networks using Co-Domain Coverage

arXiv:2408.06766v11 citationsh-index: 36
Originality Highly original
AI Analysis

This addresses the need for robust testing of DNNs in safety-critical applications, offering a novel black-box approach that improves over existing white-box methods.

The paper tackles the problem of inadequate testing for deep neural networks by introducing a black-box coverage criterion called Co-Domain Coverage (CDC) and a fuzzing procedure named CoDoFuzz, which outperforms state-of-the-art methods by generating more misclassified inputs and inputs with low model confidence across six datasets.

Rigorous testing of machine learning models is necessary for trustworthy deployments. We present a novel black-box approach for generating test-suites for robust testing of deep neural networks (DNNs). Most existing methods create test inputs based on maximizing some "coverage" criterion/metric such as a fraction of neurons activated by the test inputs. Such approaches, however, can only analyze each neuron's behavior or each layer's output in isolation and are unable to capture their collective effect on the DNN's output, resulting in test suites that often do not capture the various failure modes of the DNN adequately. These approaches also require white-box access, i.e., access to the DNN's internals (node activations). We present a novel black-box coverage criterion called Co-Domain Coverage (CDC), which is defined as a function of the model's output and thus takes into account its end-to-end behavior. Subsequently, we develop a new fuzz testing procedure named CoDoFuzz, which uses CDC to guide the fuzzing process to generate a test suite for a DNN. We extensively compare the test suite generated by CoDoFuzz with those generated using several state-of-the-art coverage-based fuzz testing methods for the DNNs trained on six publicly available datasets. Experimental results establish the efficiency and efficacy of CoDoFuzz in generating the largest number of misclassified inputs and the inputs for which the model lacks confidence in its decision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes