87.2SEApr 7Code
Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale ProjectsSyed Mohammad Kashif, Ruiyin Li, Peng Liang et al.
New generation of AI coding tools, including AI-powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project-level code at scale. However, there is limited empirical evidence on the extent to which they can generate large-scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large-scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD-HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at https://github.com/Kashifraz/DIinAGP
SEDec 28, 2025Code
FasterPy: An LLM-based Code Execution Efficiency Optimization FrameworkYue Wu, Minghao Han, Ruiyin Li et al.
Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at https://github.com/WuYue22/fasterpy.
92.6SEMay 31
Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical MemoryRuiyin Li, Yiran Zhang, Xiyu Zhou et al.
Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality attributes and adapting to evolving requirements. Traditionally, this process has been time-consuming, labor-intensive, and heavily reliant on architects, often resulting in limited exploration of alternative architectural decompositions and styles, especially under the pressures of agile development. While LLM-based agents have shown promising performance across various software engineering tasks, their application to architecture design remains relatively scarce and requires systematic exploration. To address these challenges, we proposed MAAD (Multi-Agent Architecture Design), a knowledge-driven framework that orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to autonomously and collaboratively transform requirements specifications into comprehensive, multi-view architectural blueprints with quality attribute assessments. MAAD incorporates RAG to inject recognized architectural standards and patterns into the workflow and leverages a hierarchical memory mechanism that captures design history for iterative refinement. We evaluated MAAD through comparative experiments against MetaGPT, using quantitative architecture-level metrics across 10 case studies and qualitative feedback from industry architects on 10 real-world specifications. Results show that MAAD generates more complete, modular, and traceable architectures than the baseline, and its dedicated Evaluator agent autonomously produces structured quality evaluation reports that significantly reduce manual validation efforts. Furthermore, we found that the quality of the generated architecture heavily depends on the underlying LLM's reasoning capacity, with GPT-5.2 and Qwen3.5 outperforming other models across most evaluation settings.
83.0SEMay 2Code
Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner SurveyYifei Wang, Ruiyin Li, Peng Liang et al.
Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human expertise and judgment. However, there has been little research focusing on how LLMs are used in software design, nor on the associated benefits and drawbacks. This paper aims to bridge this gap by empirically investigating how software developers utilize LLMs in the context of software design. We conduct a mixed-methods study, combining a mining study of 291 developer-ChatGPT conversations shared on GitHub with a survey of 65 software practitioners. Our findings reveal nine distinct categories of design tasks supported by ChatGPT, including architecture design, data model design, and the use of design patterns. We further characterize developer-ChatGPT interactions, showing that developers primarily use ChatGPT for knowledge acquisition and design-related code generation, with most tasks situated at the detailed design level. The study identifies seven key benefits of utilizing LLMs in software design as perceived by developers, such as better technology selection and the early detection of design flaws. We also uncover six limitations, including the generation of overly lengthy and difficult-to-read outputs, the creation of inexecutable or incorrect code, and a heavy reliance on context that can lead to hallucinated results. These findings provide an evidence-based characterization of current LLM use in software design from both open-source and practitioner perspectives, highlighting a tension between perceived benefits and limitations, which lays a foundation for future research and the development of effective techniques and tools to integrate LLMs into software design practices.
SENov 11, 2025
Designing LLM-based Multi-Agent Systems for Software Engineering Tasks: Quality Attributes, Design Patterns and RationaleYangxiao Cai, Ruiyin Li, Peng Liang et al.
As the complexity of Software Engineering (SE) tasks continues to escalate, Multi-Agent Systems (MASs) have emerged as a focal point of research and practice due to their autonomy and scalability. Furthermore, through leveraging the reasoning and planning capabilities of Large Language Models (LLMs), the application of LLM-based MASs in the field of SE is garnering increasing attention. However, there is no dedicated study that systematically explores the design of LLM-based MASs, including the Quality Attributes (QAs) on which the designers mainly focus, the design patterns used by the designers, and the rationale guiding the design of LLM-based MASs for SE tasks. To this end, we conducted a study to identify the QAs that LLM-based MASs for SE tasks focus on, the design patterns used in the MASs, and the design rationale for the MASs. We collected 94 papers on LLM-based MASs for SE tasks as the source. Our study shows that: (1) Code Generation is the most common SE task solved by LLM-based MASs among ten identified SE tasks, (2) Functional Suitability is the QA on which designers of LLM-based MASs pay the most attention, (3) Role-Based Cooperation is the design pattern most frequently employed among 16 patterns used to construct LLM-based MASs, and (4) Improving the Quality of Generated Code is the most common rationale behind the design of LLM-based MASs. Based on the study results, we presented the implications for the design of LLM-based MASs to support SE tasks.
SEOct 24, 2025Code
ArchISMiner: A Framework for Automatic Mining of Architectural Issue-Solution Pairs from Online Developer CommunitiesMusengamana Jean de Dieu, Ruiyin Li, Peng Liang et al.
Stack Overflow (SO), a leading online community forum, is a rich source of software development knowledge. However, locating architectural knowledge, such as architectural solutions remains challenging due to the overwhelming volume of unstructured content and fragmented discussions. Developers must manually sift through posts to find relevant architectural insights, which is time-consuming and error-prone. This study introduces ArchISMiner, a framework for mining architectural knowledge from SO. The framework comprises two complementary components: ArchPI and ArchISPE. ArchPI trains and evaluates multiple models, including conventional ML/DL models, Pre-trained Language Models (PLMs), and Large Language Models (LLMs), and selects the best-performing model to automatically identify Architecture-Related Posts (ARPs) among programming-related discussions. ArchISPE employs an indirect supervised approach that leverages diverse features, including BERT embeddings and local TextCNN features, to extract architectural issue-solution pairs. Our evaluation shows that the best model in ArchPI achieves an F1-score of 0.960 in ARP detection, and ArchISPE outperforms baselines in both SE and NLP fields, achieving F1-scores of 0.883 for architectural issues and 0.894 for solutions. A user study further validated the quality (e.g., relevance and usefulness) of the identified ARPs and the extracted issue-solution pairs. Moreover, we applied ArchISMiner to three additional forums, releasing a dataset of over 18K architectural issue-solution pairs. Overall, ArchISMiner can help architects and developers identify ARPs and extract succinct, relevant, and useful architectural knowledge from developer communities more accurately and efficiently. The replication package of this study has been provided at https://github.com/JeanMusenga/ArchISPE
SEJan 4, 2022Code
Symptoms of Architecture Erosion in Code Reviews: A Study of Two OpenStack ProjectsRuiyin Li, Mohamed Soliman, Peng Liang et al.
The phenomenon of architecture erosion can negatively impact the maintenance and evolution of software systems, and manifest in a variety of symptoms during software development. While erosion is often considered rather late, its symptoms can act as early warnings to software developers, if detected in time. In addition to static source code analysis, code reviews can be a source of detecting erosion symptoms and subsequently taking action. In this study, we investigate the erosion symptoms discussed in code reviews, as well as their trends, and the actions taken by developers. Specifically, we conducted an empirical study with the two most active Open Source Software (OSS) projects in the OpenStack community (i.e., Nova and Neutron). We manually checked 21,274 code review comments retrieved by keyword search and random selection, and identified 502 code review comments (from 472 discussion threads) that discuss erosion. Our findings show that (1) the proportion of erosion symptoms is rather low, yet notable in code reviews and the most frequently identified erosion symptoms are architectural violation, duplicate functionality, and cyclic dependency; (2) the declining trend of the identified erosion symptoms in the two OSS projects indicates that the architecture tends to stabilize over time; and (3) most code reviews that identify erosion symptoms have a positive impact on removing erosion symptoms, but a few symptoms still remain and are ignored by developers. The results suggest that (1) code review provides a practical way to reduce erosion symptoms; and (2) analyzing the trend of erosion symptoms can help get an insight about the erosion status of software systems, and subsequently avoid the potential risk of architecture erosion.
SEApr 29, 2025
Using LLMs in Generating Design Rationale for Software Architecture DecisionsXiyu Zhou, Ruiyin Li, Peng Liang et al.
Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. To further understand the trustworthiness and applicability of LLM-generated DR in practice, we conducted semi-structured interviews with six practitioners. Based on the experimental and interview results, we discussed the pros and cons of the three prompting strategies, the strengths and limitations of LLM-generated DR, and the implications for the practical use of LLM-generated DR.
SEJul 28, 2025
MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent CollaborationRuiyin Li, Yiran Zhang, Xiyu Zhou et al.
Software architecture design is a critical, yet inherently complex and knowledge-intensive phase of software development. It requires deep domain expertise, development experience, architectural knowledge, careful trade-offs among competing quality attributes, and the ability to adapt to evolving requirements. Traditionally, this process is time-consuming and labor-intensive, and relies heavily on architects, often resulting in limited design alternatives, especially under the pressures of agile development. While Large Language Model (LLM)-based agents have shown promising performance across various SE tasks, their application to architecture design remains relatively scarce and requires more exploration, particularly in light of diverse domain knowledge and complex decision-making. To address the challenges, we proposed MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design. MAAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints enriched with quality attributes-based evaluation reports. We then evaluated MAAD through a case study and comparative experiments against MetaGPT, a state-of-the-art MAS baseline. Our results show that MAAD's superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports. Feedback from industrial architects across 11 requirements specifications further reinforces MAAD's practical usability. We finally explored the performance of the MAAD framework with three LLMs (GPT-4o, DeepSeek-R1, and Llama 3.3) and found that GPT-4o exhibits better performance in producing architecture design, emphasizing the importance of LLM selection in MAS-driven architecture design.
SEDec 21, 2021
Understanding Software Architecture Erosion: A Systematic Mapping StudyRuiyin Li, Peng Liang, Mohamed Soliman et al.
Architecture erosion (AEr) can adversely affect software development and has received significant attention in the last decade. However, there is an absence of a comprehensive understanding of the state of research about the reasons and consequences of AEr, and the countermeasures to address AEr. This work aims at systematically investigating, identifying, and analyzing the reasons, consequences, and ways of detecting and handling AEr. With 73 studies included, the main results are as follows: (1) AEr manifests not only through architectural violations and structural issues but also causing problems in software quality and during software evolution; (2) non-technical reasons that cause AEr should receive the same attention as technical reasons, and practitioners should raise awareness of the grave consequences of AEr, thereby taking actions to tackle AEr-related issues; (3) a spectrum of approaches, tools, and measures has been proposed and employed to detect and tackle AEr; and (4) three categories of difficulties and five categories of lessons learned on tackling AEr were identified. The results can provide researchers a comprehensive understanding of AEr and help practitioners handle AEr and improve the sustainability of their architecture. More empirical studies are required to investigate the practices of detecting and addressing AEr in industrial settings.
SEMar 21, 2021
Understanding Architecture Erosion: The Practitioners' PerceptiveRuiyin Li, Peng Liang, Mohamed Soliman et al.
As software systems evolve, their architecture is meant to adapt accordingly by following the changes in requirements, the environment, and the implementation. However, in practice, the evolving system often deviates from the architecture, causing severe consequences to system maintenance and evolution. This phenomenon of architecture erosion has been studied extensively in research, but not yet been examined from the point of view of developers. In this exploratory study, we look into how developers perceive the notion of architecture erosion, its causes and consequences, as well as tools and practices to identify and control architecture erosion. To this end, we searched through several popular online developer communities for collecting data of discussions related to architecture erosion. Besides, we identified developers involved in these discussions and conducted a survey with 10 participants and held interviews with 4 participants. Our findings show that: (1) developers either focus on the structural manifestation of architecture erosion or on its effect on run-time qualities, maintenance and evolution; (2) alongside technical factors, architecture erosion is caused to a large extent by non-technical factors; (3) despite the lack of dedicated tools for detecting architecture erosion, developers usually identify erosion through a number of symptoms; and (4) there are effective measures that can help to alleviate the impact of architecture erosion.