4 Papers

SEMar 16
Making Software Metrics Useful

Ewan Tempero, Paul Ralph

Most engineers use measurements to make decisions. However, measurements are rarely used for decisions about constructing software products. While many approaches to measuring attributes of software (``metrics'') have been developed, they are rarely used to answer useful questions such as ``Do I need to refactor this class?'' or ``Are these integration tests sufficient?'' Practitioners therefore question the value of software metrics. We argue that this situation arose because software metrics were developed without understanding metrology (the science of measurement) and suggest directions software metrics research should take.

SEMay 19
Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?

Zushuai Zhang, Elliott Wen, Ewan Tempero

Background: Large Language Models (LLMs) are increasingly used for code generation. However, their ability to generate multi-class projects that require object-oriented design (OOD) remains unclear, especially relative to projects developed with human involvement. Aims: The primary objective of this study is to compare OOD quality in projects from three authorship conditions: PreAI (human-involved projects produced before widespread LLM use), PostAI (human-involved projects produced after widespread LLM use), and PureAI (projects generated end-to-end by contemporary LLMs). Method: We conducted a comparative case study on a postgraduate Java assignment. Two offerings of the same assignment were selected as the PreAI and PostAI datasets. PureAI projects were generated using three contemporary LLMs. We analyzed OOD quality using project-level OOD metrics, code smell density, and domain modeling. Results: Relative to human-involved projects, PureAI projects show lower code smell density and generally appear simpler in terms of total size, complexity, and coupling. However, this is consistent with oversimplification, as it is associated with missing abstractions and weaker responsibility separation. PostAI is closer to PureAI than PreAI on many OOD measures and also shows tendencies toward oversimplification. Conclusions: Our findings indicate that appropriate human guidance on object-oriented decomposition and responsibility assignment remains important when LLMs are used for object-oriented design.

DCOct 30, 2025
Mind the Gap: Revealing Inconsistencies Across Heterogeneous AI Accelerators

Elliott Wen, Sean Ma, Ewan Tempero et al.

While NVIDIA remains the dominant provider of AI accelerators within cloud data center, emerging vendors such as AMD, Intel, Mac, and Huawei offer cost-effective alternatives with claims of compatibility and performance. This paper presents the first empirical study investigating divergence in machine learning model across heterogeneous AI accelerators. Utilizing an automated pipeline, we synthesize over 100,000 variant models derived from 4,000 real-world models and execute them across five different enterprise-grade accelerators. Our findings suggest that newer AI platforms from Mac and Huawei support at least 17\% fewer operators than NVIDIA. These platforms also exhibit a higher rate of output discrepancies (exceeding 5\%), which stem from differences in operator implementations, handling of exceptional numerical values, and instruction scheduling. They are also more susceptible to failures during model compilation-based acceleration, and in some cases, the compiled models produce outputs that differ noticeably from those generated using the standard execution mode. In addition, we identify 7 implementation flaws in PyTorch and 40 platform-specific issues across vendors. These results underscore the challenges of achieving consistent machine learning behavior in an increasingly diverse hardware ecosystem.

SEDec 31, 2020
Consolidating a Model for Describing Situated Software Practices

Diana Kirk, Stephen G. MacDonell, Ewan Tempero

Many prescriptive approaches to developing software intensive systems have been advocated but each is based on assumptions about context. It has been found that practitioners do not follow prescribed methodologies, but rather select and adapt specific practices according to local needs. As researchers, we would like to be in a position to support such tailoring. However, at the present time we simply do not have sufficient evidence relating practice and context for this to be possible. We have long understood that a deeper understanding of situated software practices is crucial for progress in this area, and have been exploring this problem from a number of perspectives. In this position paper, we draw together the various aspects of our work into a holistic model and discuss the ways in which the model might be applied to support the long term goal of evidence-based decision support for practitioners. The contribution specific to this paper is a discussion on model evaluation, including a proof-of-concept demonstration of model utility. We map Kernel elements from the Essence system to our model and discuss gaps and limitations exposed in the Kernel. Finally, we overview our plans for further refining and evaluating the model.