SE CL CRMar 7, 2025

AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models

Hengrui Xing, Cong Tian, Liang Zhao, Zhi Ma, WenSheng Wang, Nan Zhang, Chao Huang, Zhenhua Duan

arXiv:2503.05102v1h-index: 24ACM Trans Softw Eng Methodol

Originality Incremental advance

AI Analysis

This work addresses the problem of comprehensive and automated evaluation for NLP models, which is incremental as it builds on existing behavioral testing methods by adding automation and multidimensional coverage.

The paper tackles the limitations of manual labor and limited scope in behavioral testing for NLP models by introducing AutoTestForge, an automated multidimensional testing framework that uses LLMs to generate test templates and a multi-model voting system for validation, resulting in higher error detection rates (e.g., 30.89% for sentiment analysis and 34.58% for semantic textual similarity).

In recent years, the application of behavioral testing in Natural Language Processing (NLP) model evaluation has experienced a remarkable and substantial growth. However, the existing methods continue to be restricted by the requirements for manual labor and the limited scope of capability assessment. To address these limitations, we introduce AutoTestForge, an automated and multidimensional testing framework for NLP models in this paper. Within AutoTestForge, through the utilization of Large Language Models (LLMs) to automatically generate test templates and instantiate them, manual involvement is significantly reduced. Additionally, a mechanism for the validation of test case labels based on differential testing is implemented which makes use of a multi-model voting system to guarantee the quality of test cases. The framework also extends the test suite across three dimensions, taxonomy, fairness, and robustness, offering a comprehensive evaluation of the capabilities of NLP models. This expansion enables a more in-depth and thorough assessment of the models, providing valuable insights into their strengths and weaknesses. A comprehensive evaluation across sentiment analysis (SA) and semantic textual similarity (STS) tasks demonstrates that AutoTestForge consistently outperforms existing datasets and testing tools, achieving higher error detection rates (an average of $30.89\%$ for SA and $34.58\%$ for STS). Moreover, different generation strategies exhibit stable effectiveness, with error detection rates ranging from $29.03\% - 36.82\%$.

View on arXiv PDF

Similar