Revisiting Neuron Coverage for DNN Testing: A Layer-Wise and Distribution-Aware Criterion
This work addresses the need for more effective DNN testing criteria to improve software reliability, though it is incremental as it builds on prior coverage methods.
The paper tackles the problem that existing neuron coverage criteria for deep neural network (DNN) testing show little correlation with test suite quality by proposing a new criterion, NeuraL Coverage (NLC), which assesses test suites based on layer output distributions. The result demonstrates that NLC is significantly correlated with test suite diversity across tasks and data formats, and mutation guided by NLC leads to greater quality and diversity in exposing erroneous behaviors.
Various deep neural network (DNN) coverage criteria have been proposed to assess DNN test inputs and steer input mutations. The coverage is characterized via neurons having certain outputs, or the discrepancy between neuron outputs. Nevertheless, recent research indicates that neuron coverage criteria show little correlation with test suite quality. In general, DNNs approximate distributions, by incorporating hierarchical layers, to make predictions for inputs. Thus, we champion to deduce DNN behaviors based on its approximated distributions from a layer perspective. A test suite should be assessed using its induced layer output distributions. Accordingly, to fully examine DNN behaviors, input mutation should be directed toward diversifying the approximated distributions. This paper summarizes eight design requirements for DNN coverage criteria, taking into account distribution properties and practical concerns. We then propose a new criterion, NeuraL Coverage (NLC), that satisfies all design requirements. NLC treats a single DNN layer as the basic computational unit (rather than a single neuron) and captures four critical properties of neuron output distributions. Thus, NLC accurately describes how DNNs comprehend inputs via approximated distributions. We demonstrate that NLC is significantly correlated with the diversity of a test suite across a number of tasks (classification and generation) and data formats (image and text). Its capacity to discover DNN prediction errors is promising. Test input mutation guided by NLC results in a greater quality and diversity of exposed erroneous behaviors.