42.4SEMay 22Code
Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game DevelopmentSrijita Basu, Viktor Kjellberg, Simin Sun et al.
Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.
43.7SEMar 24
SE Journals in 2036: Looking Back at the Future We Need to HaveTim Menzies, Paris Avgeriou, Robert Feldt et al.
In 2025, SE publishing faces an existential crisis of scalability. As our communities swell globally and integrate fast-moving methodologies like LLMs, traditional peer-review practices are collapsing under the strain. The "bureaucratic anomaly" of monolithic review has become mathematically unsustainable, creating a stochastic "lottery" that punishes novelty and exhausts researchers. This paper, written from the perspective of 2036, documents potential solutions. Here, the editors of ASE, EMSE, IST, JSS, TOSEM and TSE dream a collective dream of a brighter future. In summary first we stopped fighting (The Journal Alliance). Then we fixed the process (The Lottery / Unbundling / Fixing the Benchmark Graveyard). And then we fixed the culture (Cathedrals/Bazaars).
35.8SEMay 26
Strategies for Guiding LLMs to Use Software Design Patterns: A Case of SingletonViktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron
Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code's functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.
CYAug 8, 2024
Using generative AI to support standardization work -- the case of 3GPPMiroslaw Staron, Jonathan Strom, Albin Karlsson et al.
Standardization processes build upon consensus between partners, which depends on their ability to identify points of disagreement and resolving them. Large standardization organizations, like the 3GPP or ISO, rely on leaders of work packages who can correctly, and efficiently, identify disagreements, discuss them and reach a consensus. This task, however, is effort-, labor-intensive and costly. In this paper, we address the problem of identifying similarities, dissimilarities and discussion points using large language models. In a design science research study, we work with one of the organizations which leads several workgroups in the 3GPP standard. Our goal is to understand how well the language models can support the standardization process in becoming more cost-efficient, faster and more reliable. Our results show that generic models for text summarization correlate well with domain expert's and delegate's assessments (Pearson correlation between 0.66 and 0.98), but that there is a need for domain-specific models to provide better discussion materials for the standardization groups.
22.9CRApr 8
Data Leakage in Automotive Perception: Practitioners' InsightsMd Abu Ahammed Babu, Sushant Kumar Pandey, Darko Durisic et al.
Data leakage is the inadvertent transfer of information between training and evaluation datasets that poses a subtle, yet critical, risk to the reliability of machine learning (ML) models in safety-critical systems such as automotive perception. While leakage is widely recognized in research, little is known about how industrial practitioners actually perceive and manage it in practice. This study investigates practitioners' knowledge, experiences, and mitigation strategies around data leakage through ten semi-structured interviews with system design, development, and verification engineers working on automotive perception functions development. Using reflexive thematic analysis, we identify that knowledge of data leakage is widespread and fragmented along role boundaries: ML engineers conceptualize it as a data-splitting or validation issue, whereas design and verification roles interpret it in terms of representativeness and scenario coverage. Detection commonly arises through generic considerations and observed performance anomalies rather than implying specific tools. However, data leakage prevention is more commonly practiced, which depends mostly on experience and knowledge sharing. These findings suggest that leakage control is a socio-technical coordination problem distributed across roles and workflows. We discuss implications for ML reliability engineering, highlighting the need for shared definitions, traceable data practices, and continuous cross-role communication to institutionalize data leakage awareness within automotive ML development.
SEApr 25, 2019Code
PHANTOM: Curating GitHub for engineered software projects using time-series clusteringPeter Pickerill, Heiko Joshua Jungen, Miroslaw Ochodek et al.
Context: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets. Objective: The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way. Method: This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm. Results: Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify "engineered" projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies. Conclusions: It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.
CROct 29, 2024
Improving Image Data Leakage Detection in Automotive SoftwareMd Abu Ahammed Babu, Sushant Kumar Pandey, Darko Durisic et al.
Data leakage is a very common problem that is often overlooked during splitting data into train and test sets before training any ML/DL model. The model performance gets artificially inflated with the presence of data leakage during the evaluation phase which often leads the model to erroneous prediction on real-time deployment. However, detecting the presence of such leakage is challenging, particularly in the object detection context of perception systems where the model needs to be supplied with image data for training. In this study, we conduct a computational experiment on the Cirrus dataset from our industrial partner Volvo Cars to develop a method for detecting data leakage. We then evaluate the method on another public dataset, Kitti, which is a popular and widely accepted benchmark dataset in the automotive domain. The results show that thanks to our proposed method we are able to detect data leakage in the Kitti dataset, which was previously unknown.
LGJan 4, 2022
Trusting Machine Learning Results from Medical Procedures in the Operating RoomAli El-Merhi, Helena Odenstedt Hergés, Linda Block et al.
Machine learning can be used to analyse physiological data for several purposes. Detection of cerebral ischemia is an achievement that would have high impact on patient care. We attempted to study if collection of continous physiological data from non-invasive monitors, and analysis with machine learning could detect cerebral ischemia in tho different setting, during surgery for carotid endarterectomy and during endovascular thrombectomy in acute stroke. We compare the results from the two different group and one patient from each group in details. While results from CEA-patients are consistent, those from thrombectomy patients are not and frequently contain extreme values such as 1.0 in accuracy. We conlcude that this is a result of short duration of the procedure and abundance of data with bad quality resulting in small data sets. These results can therefore not be trusted.
LGDec 20, 2021
Towards Trustworthy Cross-patient Model DevelopmentAli El-Merhi, Helena Odenstedt Hergés, Linda Block et al.
Machine learning is used in medicine to support physicians in examination, diagnosis, and predicting outcomes. One of the most dynamic area is the usage of patient generated health data from intensive care units. The goal of this paper is to demonstrate how we advance cross-patient ML model development by combining the patient's demographics data with their physiological data. We used a population of patients undergoing Carotid Enderarterectomy (CEA), where we studied differences in model performance and explainability when trained for all patients and one patient at a time. The results show that patients' demographics has a large impact on the performance and explainability and thus trustworthiness. We conclude that we can increase trust in ML models in a cross-patient context, by careful selection of models and patients based on their demographics and the surgical procedure.
SEMar 15, 2021
Robust Machine Learning in Critical Care -- Software Engineering and Medical PerspectivesMiroslaw Staron, Helena Odenstedt Hergés, Silvana Naredi et al.
Using machine learning in clinical practice poses hard requirements on explainability, reliability, replicability and robustness of these systems. Therefore, developing reliable software for monitoring critically ill patients requires close collaboration between physicians and software engineers. However, these two different disciplines need to find own research perspectives in order to contribute to both the medical and the software engineering domain. In this paper, we address the problem of how to establish a collaboration where software engineering and medicine meets to design robust machine learning systems to be used in patient care. We describe how we designed software systems for monitoring patients under carotid endarterectomy, in particular focusing on the process of knowledge building in the research team. Our results show what to consider when setting up such a collaboration, how it develops over time and what kind of systems can be constructed based on it. We conclude that the main challenge is to find a good research team, where different competences are committed to a common goal.
SEOct 7, 2020
Empirical Standards for Software Engineering ResearchPaul Ralph, Nauman bin Ali, Sebastian Baltes et al.
Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.
SEAug 9, 2017
Predicting and Evaluating Software Model Growth in the Automotive IndustryJan Schroeder, Christian Berger, Alessia Knauss et al.
The size of a software artifact influences the software quality and impacts the development process. In industry, when software size exceeds certain thresholds, memory errors accumulate and development tools might not be able to cope anymore, resulting in a lengthy program start up times, failing builds, or memory problems at unpredictable times. Thus, foreseeing critical growth in software modules meets a high demand in industrial practice. Predicting the time when the size grows to the level where maintenance is needed prevents unexpected efforts and helps to spot problematic artifacts before they become critical. Although the amount of prediction approaches in literature is vast, it is unclear how well they fit with prerequisites and expectations from practice. In this paper, we perform an industrial case study at an automotive manufacturer to explore applicability and usability of prediction approaches in practice. In a first step, we collect the most relevant prediction approaches from literature, including both, approaches using statistics and machine learning. Furthermore, we elicit expectations towards predictions from practitioners using a survey and stakeholder workshops. At the same time, we measure software size of 48 software artifacts by mining four years of revision history, resulting in 4,547 data points. In the last step, we assess the applicability of state-of-the-art prediction approaches using the collected data by systematically analyzing how well they fulfill the practitioners' expectations. Our main contribution is a comparison of commonly used prediction approaches in a real world industrial setting while considering stakeholder expectations. We show that the approaches provide significantly different results regarding prediction accuracy and that the statistical approaches fit our data best.