Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting
This addresses the time-consuming and expertise-dependent process of test case generation for NLP model evaluation, representing an incremental improvement over existing semi-automated methods.
The paper tackles the challenge of creating behavioral test cases for NLP models by introducing an automated approach using clustering and prompting with large language models, demonstrated on the Amazon Reviews corpus across four classification algorithms.
Recent work in behavioral testing for natural language processing (NLP) models, such as Checklist, is inspired by related paradigms in software engineering testing. They allow evaluation of general linguistic capabilities and domain understanding, hence can help evaluate conceptual soundness and identify model weaknesses. However, a major challenge is the creation of test cases. The current packages rely on semi-automated approach using manual development which requires domain expertise and can be time consuming. This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques. It clusters the text representations to carefully construct meaningful groups and then apply prompting techniques to automatically generate Minimal Functionality Tests (MFT). The well-known Amazon Reviews corpus is used to demonstrate our approach. We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.