Mohammad Reza Mousavi

SE
h-index11
10papers
64citations
Novelty37%
AI Score50

10 Papers

AIJun 22, 2022
On Specifying for Trustworthiness

Dhaminda B. Abeywickrama, Amel Bennaceur, Greg Chance et al.

As autonomous systems (AS) increasingly become part of our daily lives, ensuring their trustworthiness is crucial. In order to demonstrate the trustworthiness of an AS, we first need to specify what is required for an AS to be considered trustworthy. This roadmap paper identifies key challenges for specifying for trustworthiness in AS, as identified during the "Specifying for Trustworthiness" workshop held as part of the UK Research and Innovation (UKRI) Trustworthy Autonomous Systems (TAS) programme. We look across a range of AS domains with consideration of the resilience, trust, functionality, verifiability, security, and governance and regulation of AS and identify some of the key specification challenges in these domains. We then highlight the intellectual challenges that are involved with specifying for trustworthiness in AS that cut across domains and are exacerbated by the inherent uncertainty involved with the environments in which AS need to operate.

SEMay 13
Robust Mutation Analysis of Quantum Programs Under Noise

Sophie Fortz, Eñaut Mendiluze Usandizaga, Shaukat Ali et al.

Mutation analysis has long been used in classical software testing and has recently been adopted for assessing the robustness of quantum software testing techniques. However, existing studies assume ideal, noiseless execution, overlooking the impact of quantum hardware noise. In this paper, we present an empirical study of noise-aware mutation analysis for quantum programs. We analyze how noise affects mutant detection using 41 quantum programs, executed on noiseless and noisy simulators emulating three IBM devices with different noise profiles. We compare several distance metrics and thresholding strategies to evaluate mutant detection under realistic noise. Our results show that noise significantly alters the behavioral distance between programs and mutants, making equivalent mutants harder to distinguish from real faults. Density-matrix metrics achieve the best discrimination, with misclassification rates up to 16.77%, but are not accessible on real hardware. Among practical alternatives, output-distribution metrics reach up to 73.03% accuracy and 74.89% F1-score. Noise-specific thresholds further improve detection compared to noiseless thresholds. We also find that noise effects correlate more with algorithm and circuit characteristics than with mutation types. Overall, our results highlight the need to adapt mutation analysis, and more generally quantum program comparison, to the noise profiles of target quantum devices.

SEMay 13
(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi

Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.

AIMay 26, 2025Code
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

Ying Xiao, Jie Huang, Ruijuan He et al.

Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.

QUANT-PHMar 30
Toward Live Noise Fingerprinting in Quantum Software Engineering

Avner Bensoussan, Elena Chachkarova, Karine Even-Mendoza et al.

Contemporary quantum computers are inherently noisy, posing significant challenges for the development and testing of quantum software. Simplified or outdated noise assumptions can lead to incorrect assessments of program correctness, obscure debugging, and hinder cross-platform portability, creating a critical quantum software development gap. Providing accurate, practical noise characterisation is challenging as traditional reconstruction methods scale exponentially and rapidly become outdated. In this vision paper, we address this gap via a novel classical shadow tomography-based pipeline, SIMSHADOW, enabling efficient, continuously updatable noise fingerprinting from empirical observations, suitable for integration into quantum software development workflows, including testing and validation. We prototyped the pipeline to investigate fingerprints' ability to capture structured, interpretable noise and cross-platform discrepancies affecting quantum programs' behaviour to support realistic testing and debugging in future tools. Our evaluation with Qiskit and Cirq under widely used hardware-informed profiles, IBM Boston and Quantinuum H2, shows fingerprints exhibit channel-specific structure and yield interpretable heatmaps. We observed systematic cross-platform discrepancies under matched noise configurations, quantified by large Frobenius distances at a fraction of full tomography cost. On 69 MQTBENCH programs, larger fingerprint differences correlate with output distributions divergences, highlighting threats for testing and cross-platform debugging tasks.

SENov 3, 2025
The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project

Robin Gröpler, Steffen Klepke, Jack Johns et al.

Generative AI (GenAI) has recently emerged as a groundbreaking force in Software Engineering, capable of generating code, identifying bugs, recommending fixes, and supporting quality assurance. While its use in coding tasks shows considerable promise, applying GenAI across the entire Software Development Life Cycle (SDLC) has not yet been fully explored. Critical uncertainties in areas such as reliability, accountability, security, and data privacy demand deeper investigation and coordinated action. The GENIUS project, comprising over 30 European industrial and academic partners, aims to address these challenges by advancing AI integration across all SDLC phases. It focuses on GenAI's potential, the development of innovative tools, and emerging research challenges, actively shaping the future of software engineering. This vision paper presents a shared perspective on the future of GenAI-driven software engineering, grounded in cross-sector dialogue as well as experiences and findings within the GENIUS consortium. The paper explores four central elements: (1) a structured overview of current challenges in GenAI adoption across the SDLC; (2) a forward-looking vision outlining key technological and methodological advances expected over the next five years; (3) anticipated shifts in the roles and required skill sets of software professionals; and (4) the contribution of GENIUS in realising this transformation through practical tools and industrial validation. This paper focuses on aligning technical innovation with business relevance. It aims to inform both research agendas and industrial strategies, providing a foundation for reliable, scalable, and industry-ready GenAI solutions for software engineering teams.

SEApr 5
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

Shuyin Ouyang, Jie M. Zhang, Jingzhi Gong et al.

Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in large language models has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-language models (VLMs) to understand software architecture diagrams remains underexplored. To address this gap, we present SADU, a benchmark for Software Architecture Diagram Understanding that evaluates VLMs on architecture diagrams as structured software engineering artifacts rather than generic images. SADU contains 154 carefully curated diagrams spanning behavioral, structural, and ER diagrams, paired with structured annotations and 2,431 question-answer tasks covering counting and retrieval reasoning. We evaluate 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families. Our results show that software architecture diagram understanding remains challenging for current models: the best-performing model gemini-3-flash-preview achieves only 70.18\% accuracy, while gpt-4o-mini only achieves 17.77\% accuracy. The results further reveal the weaknesses in diagram reasoning and visual relation grounding, highlighting a gap between current VLMs and the needs of design-stage software engineering. SADU provides a foundation for future research on diagram-aware AI systems and more faithful AI-assisted software engineering workflows.

LGApr 23, 2025
Compositional Active Learning of Synchronizing Systems through Automated Alphabet Refinement

Leo Henry, Thomas Neele, Mohammad Reza Mousavi et al.

Active automata learning infers automaton models of systems from behavioral observations, a technique successfully applied to a wide range of domains. Compositional approaches for concurrent systems have recently emerged. We take a significant step beyond available results, including those by the authors, and develop a general technique for compositional learning of a synchronizing parallel system with an unknown decomposition. Our approach automatically refines the global alphabet into component alphabets while learning the component models. We develop a theoretical treatment of distributions of alphabets, i.e., sets of possibly overlapping component alphabets. We characterize counter-examples that reveal inconsistencies with global observations, and show how to systematically update the distribution to restore consistency. We present a compositional learning algorithm implementing these ideas, where learning counterexamples precisely correspond to distribution counterexamples under well-defined conditions. We provide an implementation, called CoalA, using the state-of-the-art active learning library LearnLib. Our experiments show that in more than 630 subject systems, CoalA delivers orders of magnitude improvements (up to five orders) in membership queries and in systems with significant concurrency, it also achieves better scalability in the number of equivalence queries.

SEMar 28, 2014
Spinal Test Suites for Software Product Lines

Harsh Beohar, Mohammad Reza Mousavi

A major challenge in testing software product lines is efficiency. In particular, testing a product line should take less effort than testing each and every product individually. We address this issue in the context of input-output conformance testing, which is a formal theory of model-based testing. We extend the notion of conformance testing on input-output featured transition systems with the novel concept of spinal test suites. We show how this concept dispenses with retesting the common behavior among different, but similar, products of a software product line.

SEMar 5, 2013
Decomposability in Input Output Conformance Testing

Neda Noroozi, Mohammad Reza Mousavi, Tim A. C. Willemse

We study the problem of deriving a specification for a third-party component, based on the specification of the system and the environment in which the component is supposed to reside. Particularly, we are interested in using component specifications for conformance testing of black-box components, using the theory of input-output conformance (ioco) testing. We propose and prove sufficient criteria for decompositionality, i.e., that components conforming to the derived specification will always compose to produce a correct system with respect to the system specification. We also study the criteria for strong decomposability, by which we can ensure that only those components conforming to the derived specification can lead to a correct system.