SEDec 29, 2021

Syntactic Vs. Semantic similarity of Artificial and Real Faults in Mutation Testing Studies

Milos Ojdanic, Aayush Garg, Ahmed Khanfir, Renzo Degiovanni, Mike Papadakis, Yves Le Traon

arXiv:2112.14508v18.62 citations

Originality Incremental advance

AI Analysis

This work addresses the realism of fault-seeding techniques in software testing, showing that syntactic approaches may not capture semantic properties, which is crucial for accurate evaluation of test methods.

The study investigated whether syntactically similar artificial faults in mutation testing reflect semantic similarity to real faults, finding that syntactic similarity does not indicate semantic similarity. Results showed that CodeBERT and PiTest had similar fault detection capabilities (53% and 54% on average), outperforming IBIR (37%) and DeepMutation (7%).

Fault seeding is typically used in controlled studies to evaluate and compare test techniques. Central to these techniques lies the hypothesis that artificially seeded faults involve some form of realistic properties and thus provide realistic experimental results. In an attempt to strengthen realism, a recent line of research uses advanced machine learning techniques, such as deep learning and Natural Language Processing (NLP), to seed faults that look like (syntactically) real ones, implying that fault realism is related to syntactic similarity. This raises the question of whether seeding syntactically similar faults indeed results in semantically similar faults and more generally whether syntactically dissimilar faults are far away (semantically) from the real ones. We answer this question by employing 4 fault-seeding techniques (PiTest - a popular mutation testing tool, IBIR - a tool with manually crafted fault patterns, DeepMutation - a learning-based fault seeded framework and CodeBERT - a novel mutation testing tool that use code embeddings) and demonstrate that syntactic similarity does not reflect semantic similarity. We also show that 60%, 47%, 43%, and 7% of the real faults of Defects4J V2 are semantically resembled by CodeBERT, PiTest, IBIR, and DeepMutation faults. We then perform an objective comparison between the techniques and find that CodeBERT and PiTest have similar fault detection capabilities that subsume IBIR and DeepMutation, and that IBIR is the most cost-effective technique. Moreover, the overall fault detection of PiTest, CodeBERT, IBIR, and DeepMutation was, on average, 54%, 53%, 37%, and 7%.

View on arXiv PDF

Similar