SELGMay 17

A semantic mutation metric for metamorphic relation adequacy in scientific computing programs

arXiv:2605.1743772.6
AI Analysis

For researchers in scientific computing testing, SMS provides a backward-compatible adequacy metric that captures domain semantics, though the observed effect is medium rather than large.

The paper proposes a Semantic Mutation Score (SMS) for metamorphic testing in scientific computing, using five domain-semantic operators. In a 12-program study, SMS showed a medium effect size (not reaching the pre-registered large-effect threshold) and revealed that certain mutation classes are unreachable with syntactic mutations.

Context. Metamorphic Testing addresses the test-oracle problem in scientific computing, but classical Mutation Score operates on syntactic AST mutations and misses domain semantics. Objective. We propose the Semantic Mutation Score (SMS), built on five domain-semantic operators (Conservation Erosion, Operator Substitution, Hyperparameter, Trajectory Flip, Structural Injection). SMS degenerates almost everywhere to MS in a characterised limit, so any SMS-based conclusion remains consistent with prior mutation-testing literature in the classical regime. Method. A 12-PUT x 5-MP design over four single-output float-to-float classes (numeric, probabilistic, surrogate, machine-learning) is paired with a three-layer attribution classifier separating true semantic faults from tolerance, OOD, statistical, and artefact categories. A same-source / cross-source ablation under an identical prompt isolates the LLM-source-diversity contribution. LLM-generated mutants are compared against a default-configuration cosmic-ray syntactic pool at the AST-normalised level. Results. The pre-registered large-effect threshold for Cliff's delta is not met under the point-estimate criterion; the observed effect lies in the medium-effect range. Cross-source pooling under an identical prompt does not appreciably shift delta, indicating that LLM identity is not the lever within this design. AST-level overlap between LLM-generated and default cosmic-ray syntactic mutants is small; the Hyperparameter, Structural Injection, and Trajectory Flip classes are unreachable under default first-order syntactic configurations. Conclusion. SMS is a backward-compatible adequacy metric for domain-semantic metamorphic-relation sets in scientific computing. The first-order unreachability evidence is independent of the effect-size question.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes