SEOct 30, 2023
Can ChatGPT advance software testing intelligence? An experience report on metamorphic testingQuang-Hung Luu, Huai Liu, Tsong Yueh Chen
While ChatGPT is a well-known artificial intelligence chatbot being used to answer human's questions, one may want to discover its potential in advancing software testing. We examine the capability of ChatGPT in advancing the intelligence of software testing through a case study on metamorphic testing (MT), a state-of-the-art software testing technique. We ask ChatGPT to generate candidates of metamorphic relations (MRs), which are basically necessary properties of the object program and which traditionally require human intelligence to identify. These MR candidates are then evaluated in terms of correctness by domain experts. We show that ChatGPT can be used to generate new correct MRs to test several software systems. Having said that, the majority of MR candidates are either defined vaguely or incorrect, especially for systems that have never been tested with MT. ChatGPT can be used to advance software testing intelligence by proposing MR candidates that can be later adopted for implementing tests; but human intelligence should still inevitably be involved to justify and rectify their correctness.
SEMay 18
LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience ReportZhi Quan Zhou, Dave Towey, Tsong Yueh Chen
Large language models (LLMs) are increasingly used to generate requirements specifications, design documents, code, and test cases. In contrast, much less attention has been given to a more difficult assurance problem: statically verifying whether implemented code satisfies requirements written in natural language. Conventional static analysis tools are effective at detecting coding defects and known vulnerability patterns, but they cannot determine whether program behavior matches intended business logic. Detecting such defects requires reasoning over the specification rather than the code alone. Software testing can expose some of these mismatches, but its effectiveness depends heavily on test design, executable artifacts, and runtime environments. This article presents a two-stage LLM-based workflow for addressing this challenge in an intelligent-vehicle cybersecurity case study. In the first stage, an AI-based rule miner extracts verifiable rules from natural-language requirements while explicitly identifying ambiguity, self-contradiction, and other non-verifiable statements. In the second stage, an AI-based code auditor checks implementation evidence against the extracted rules. Instead of asking a single LLM to directly verify code against lengthy natural-language specifications, the workflow introduces a structured intermediate representation to reduce hallucination, output variability, limited explainability, and context loss. The resulting approach is a requirement-aware and semantics-aware form of static analysis that complements software testing. By analyzing requirements and source code without requiring compilation, execution, or runtime environments, the method shifts verification and validation activities left in the development lifecycle. This LLM-based static analysis is also a new approach to addressing the test oracle problem.
SEApr 20
From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems TestingLinfeng Liang, Xiao Cheng, Tsong Yueh Chen et al.
Simulation-based testing of autonomous driving systems (ADS) must uncover realistic and diverse failures in dense, heterogeneous traffic. However, existing search-based seeding methods (e.g., genetic algorithms) struggle in high-dimensional spaces, often collapsing to limited modes and missing many failure scenarios. We present PtoP, a framework that combines adaptive random seed generation with Stein Variational Gradient Descent (SVGD) to produce diverse, failure-inducing initial conditions. SVGD balances attraction toward high-risk regions and repulsion among particles, yielding risk-seeking yet well-distributed seeds across multiple failure modes. PtoP is plug-and-play and enhances existing online testing methods (e.g., reinforcement learning--based testers) by providing principled seeds. Evaluation in CARLA on two industry-grade ADS (Apollo, Autoware) and a native end-to-end system shows that PtoP improves safety violation rate (up to 27.68%), scenario diversity (9.6%), and map coverage (16.78%) over baselines.
SEMay 12
Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic SurveyZheng Zheng, Zenghui Zhou, Yinwang Xu et al.
Large language models (LLMs) have introduced substantial challenges to software quality assurance due to their generative, probabilistic, and open-ended nature, which intensifies the oracle problem and limits the applicability of traditional testing methods. Metamorphic testing (MT), which checks necessary relations among multiple related executions rather than relying on exact expected outputs, has emerged as a promising approach for testing LLMs and other oracle-deficient systems. At the same time, the strong semantic understanding, reasoning, and code generation capabilities of LLMs create new opportunities to automate the traditionally labor-intensive phases of MT. This survey systematically reviews 93 primary studies and characterizes this reciprocal relationship as the bidirectional empowerment of MT and LLMs. We propose a taxonomy spanning two complementary directions: MT for LLMs, which uses MT to verify, validate, assess, and understand LLMs and LLM-based systems across issues such as hallucination, fairness, robustness, code reliability, retrieval-augmented generation, dialogue, and autonomous agents; and LLMs for MT, which leverages LLMs to support metamorphic relation discovery, input transformation and synthesis, executable test implementation, and agentic closed-loop testing. By synthesizing these developments, this survey provides a structured foundation for understanding the evolving synergy between MT and LLMs and highlights future directions for building more rigorous, scalable, and trustworthy AI quality assurance methodologies.
SEDec 20, 2024
MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue SystemsGuoxiang Guo, Aldeida Aleti, Neelofar Neelofar et al.
With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.
SEMar 28, 2025
Integrating Artificial Intelligence with Human Expertise: An In-depth Analysis of ChatGPT's Capabilities in Generating Metamorphic RelationsYifan Zhang, Dave Towey, Matthew Pike et al.
Context: This paper provides an in-depth examination of the generation and evaluation of Metamorphic Relations (MRs) using GPT models developed by OpenAI, with a particular focus on the capabilities of GPT-4 in software testing environments. Objective: The aim is to examine the quality of MRs produced by GPT-3.5 and GPT-4 for a specific System Under Test (SUT) adopted from an earlier study, and to introduce and apply an improved set of evaluation criteria for a diverse range of SUTs. Method: The initial phase evaluates MRs generated by GPT-3.5 and GPT-4 using criteria from a prior study, followed by an application of an enhanced evaluation framework on MRs created by GPT-4 for a diverse range of nine SUTs, varying from simple programs to complex systems incorporating AI/ML components. A custom-built GPT evaluator, alongside human evaluators, assessed the MRs, enabling a direct comparison between automated and human evaluation methods. Results: The study finds that GPT-4 outperforms GPT-3.5 in generating accurate and useful MRs. With the advanced evaluation criteria, GPT-4 demonstrates a significant ability to produce high-quality MRs across a wide range of SUTs, including complex systems incorporating AI/ML components. Conclusions: GPT-4 exhibits advanced capabilities in generating MRs suitable for various applications. The research underscores the growing potential of AI in software testing, particularly in the generation and evaluation of MRs, and points towards the complementarity of human and AI skills in this domain.
MEAug 17, 2021
Testing Multiple Linear Regression Systems with Metamorphic TestingQuang-Hung Luu, Man F. Lau, Sebastian P. H. Ng et al.
Regression is one of the most commonly used statistical techniques. However, testing regression systems is a great challenge because of the absence of test oracle in general. In this paper, we show that Metamorphic Testing is an effective approach to test multiple linear regression systems. In doing so, we identify intrinsic mathematical properties of linear regression, and then propose 11 Metamorphic Relations to be used for testing. Their effectiveness is examined using mutation analysis with a range of different regression programs. We further look at how the testing could be adopted in a more effective way. Our work is applicable to examine the reliability of predictive systems based on regression that has been widely used in economics, engineering and science, as well as of the regression calculation manipulated by statistical users.
SEAug 5, 2021
Using Metamorphic Relations to Verify and Enhance Artcode ClassificationLiming Xu, Dave Towey, Andrew French et al.
Software testing is often hindered where it is impossible or impractical to determine the correctness of the behaviour or output of the software under test (SUT), a situation known as the oracle problem. An example of an area facing the oracle problem is automatic image classification, using machine learning to classify an input image as one of a set of predefined classes. An approach to software testing that alleviates the oracle problem is metamorphic testing (MT). While traditional software testing examines the correctness of individual test cases, MT instead examines the relations amongst multiple executions of test cases and their outputs. These relations are called metamorphic relations (MRs): if an MR is found to be violated, then a fault must exist in the SUT. This paper examines the problem of classifying images containing visually hidden markers called Artcodes, and applies MT to verify and enhance the trained classifiers. This paper further examines two MRs, Separation and Occlusion, and reports on their capability in verifying the image classification using one-way analysis of variance (ANOVA) in conjunction with three other statistical analysis methods: t-test (for unequal variances), Kruskal-Wallis test, and Dunnett's test. In addition to our previously-studied classifier, that used Random Forests, we introduce a new classifier that uses a support vector machine, and present its MR-augmented version. Experimental evaluations across a number of performance metrics show that the augmented classifiers can achieve better performance than non-augmented classifiers. This paper also analyses how the enhanced performance is obtained.
LGApr 10, 2021
Use of Metamorphic Relations as Knowledge Carriers to Train Deep Neural NetworksTsong Yueh Chen, Pak-Lok Poon, Kun Qiu et al.
Training multiple-layered deep neural networks (DNNs) is difficult. The standard practice of using a large number of samples for training often does not improve the performance of a DNN to a satisfactory level. Thus, a systematic training approach is needed. To address this need, we introduce an innovative approach of using metamorphic relations (MRs) as "knowledge carriers" to train DNNs. Based on the concept of metamorphic testing and MRs (which play the role of a test oracle in software testing), we make use of the notion of metamorphic group of inputs as concrete instances of MRs (which are abstractions of knowledge) to train a DNN in a systematic and effective manner. To verify the viability of our training approach, we have conducted a preliminary experiment to compare the performance of two DNNs: one trained with MRs and the other trained without MRs. We found that the DNN trained with MRs has delivered a better performance, thereby confirming that our approach of using MRs as knowledge carriers to train DNNs is promising. More work and studies, however, are needed to solidify and leverage this approach to generate widespread impact on effective DNN training.
SEDec 19, 2020
A Declarative Metamorphic Testing Framework for Autonomous DrivingYao Deng, Xi Zheng, Tianyi Zhang et al.
Autonomous driving has gained much attention from both industry and academia. Currently, Deep Neural Networks (DNNs) are widely used for perception and control in autonomous driving. However, several fatal accidents caused by autonomous vehicles have raised serious safety concerns about autonomous driving models. Some recent studies have successfully used the metamorphic testing technique to detect thousands of potential issues in some popularly used autonomous driving models. However, prior study is limited to a small set of metamorphic relations, which do not reflect rich, real-world traffic scenarios and are also not customizable. This paper presents a novel declarative rule-based metamorphic testing framework called RMT. RMT provides a rule template with natural language syntax, allowing users to flexibly specify an enriched set of testing scenarios based on real-world traffic rules and domain knowledge. RMT automatically parses human-written rules to metamorphic relations using an NLP-based rule parser referring to an ontology list and generates test cases with a variety of image transformation engines. We evaluated RMT on three autonomous driving models. With an enriched set of metamorphic relations, RMT detected a significant number of abnormal model predictions that were not detected by prior work. Through a large-scale human study on Amazon Mechanical Turk, we further confirmed the authenticity of test cases generated by RMT and the validity of detected abnormal model predictions.
SEJul 30, 2020
Identification of Failure Regions for Programs with Numeric InputsRubing Huang, Weifeng Sun, Tsong Yueh Chen et al.
Failure region, where failure-causing inputs reside, has provided many insights to enhance testing effectiveness of many testing methods. Failure region may also provide some important information to support other processes such as software debugging. When a testing method detects a software failure, indicating that a failure-causing input is identified, the next important question is about how to identify the failure region based on this failure-causing input, i.e., Identification of Failure Regions (IFR). In this paper, we introduce a new IFR strategy, namely Search for Boundary (SB), to identify an approximate failure region of a numeric input domain. SB attempts to identify additional failure-causing inputs that are as close to the boundary of the failure region as possible. To support SB, we provide a basic procedure, and then propose two methods, namely Fixed-orientation Search for Boundary (FSB) and Diverse-orientation Search for Boundary (DSB). In addition, we implemented an automated experimentation platform to integrate these methods. In the experiments, we evaluated the proposed SB methods using a series of simulation studies andempirical studies with different types of failure regions. The results show that our methods can effectively identify a failure region, within the limited testing resources.
SEJul 27, 2018
METTLE: a METamorphic testing approach to assessing and validating unsupervised machine LEarning systemsXiaoyuan Xie, Zhiyi Zhang, Tsong Yueh Chen et al.
Unsupervised machine learning is the training of an artificial intelligence system using information that is neither classified nor labeled, with a view to modeling the underlying structure or distribution in a dataset. Since unsupervised machine learning systems are widely used in many real-world applications, assessing the appropriateness of these systems and validating their implementations with respect to individual users' requirements and specific application scenarios$\,/\,$contexts are indisputably two important tasks. Such assessment and validation tasks, however, are fairly challenging due to the absence of a priori knowledge of the data. In view of this challenge, we develop a $\textbf{MET}$amorphic $\textbf{T}$esting approach to assessing and validating unsupervised machine $\textbf{LE}$arning systems, abbreviated as METTLE. Our approach provides a new way to unveil the (possibly latent) characteristics of various machine learning systems, by explicitly considering the specific expectations and requirements of these systems from individual users' perspectives. To support METTLE, we have further formulated 11 generic metamorphic relations (MRs), covering users' generally expected characteristics that should be possessed by machine learning systems. To demonstrate the viability and effectiveness of METTLE we have performed an experiment involving six commonly used clustering systems. Our experiment has shown that, guided by user-defined MR-based adequacy criteria, end users are able to assess, validate, and select appropriate clustering systems in accordance with their own specific needs. Our investigation has also yielded insightful understanding and interpretation of the behavior of the machine learning systems from an end-user software engineering's perspective, rather than a designer's or implementor's perspective, who normally adopts a theoretical approach.