92.6SEMay 31
Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical MemoryRuiyin Li, Yiran Zhang, Xiyu Zhou et al.
Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality attributes and adapting to evolving requirements. Traditionally, this process has been time-consuming, labor-intensive, and heavily reliant on architects, often resulting in limited exploration of alternative architectural decompositions and styles, especially under the pressures of agile development. While LLM-based agents have shown promising performance across various software engineering tasks, their application to architecture design remains relatively scarce and requires systematic exploration. To address these challenges, we proposed MAAD (Multi-Agent Architecture Design), a knowledge-driven framework that orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to autonomously and collaboratively transform requirements specifications into comprehensive, multi-view architectural blueprints with quality attribute assessments. MAAD incorporates RAG to inject recognized architectural standards and patterns into the workflow and leverages a hierarchical memory mechanism that captures design history for iterative refinement. We evaluated MAAD through comparative experiments against MetaGPT, using quantitative architecture-level metrics across 10 case studies and qualitative feedback from industry architects on 10 real-world specifications. Results show that MAAD generates more complete, modular, and traceable architectures than the baseline, and its dedicated Evaluator agent autonomously produces structured quality evaluation reports that significantly reduce manual validation efforts. Furthermore, we found that the quality of the generated architecture heavily depends on the underlying LLM's reasoning capacity, with GPT-5.2 and Qwen3.5 outperforming other models across most evaluation settings.
75.7SEMar 27
A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell DetectionBeiqi Zhang, Peng Liang, Xin Zhou et al.
Automated code smell detection faces persistent challenges due to the subjectivity of heuristic rules and the limited performance of traditional ML/DL models. While Large Language Models (LLMs) offer a promising alternative, their adoption is impeded by high fine-tuning costs and a lack of "LM-ready" benchmarks. To bridge these gaps, we present a study with two synergistic contributions. First, we constructed a high-quality benchmark for Complex Conditional, Complex Method, Feature Envy, and Data Class, validated through a rigorous two-stage manual review. Second, leveraging this benchmark, we systematically evaluated four Parameter-Efficient Fine-Tuning (PEFT) methods across nine LMs of varying parameter sizes. Their performance is compared against a comprehensive suite of baselines, including heuristics-based detectors, Deep Learning (DL)-based approaches, and state-of-the-art general-purpose LLMs under multiple In-Context Learning (ICL) settings. Our results demonstrate that PEFT methods achieve effectiveness comparable to or surpassing full fine-tuning while substantially reducing peak GPU memory usage for code smell detection. Furthermore, PEFT-tuned LMs consistently outperform all baselines, yielding MCC improvements ranging from 0.33% to 13.69%, with particularly notable gains for specific smell categories. These findings highlight PEFT techniques as effective and scalable solutions for advancing code smell detection.
SEApr 29, 2025
Using LLMs in Generating Design Rationale for Software Architecture DecisionsXiyu Zhou, Ruiyin Li, Peng Liang et al.
Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. To further understand the trustworthiness and applicability of LLM-generated DR in practice, we conducted semi-structured interviews with six practitioners. Based on the experimental and interview results, we discussed the pros and cons of the three prompting strategies, the strengths and limitations of LLM-generated DR, and the implications for the practical use of LLM-generated DR.
SEJul 28, 2025
MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent CollaborationRuiyin Li, Yiran Zhang, Xiyu Zhou et al.
Software architecture design is a critical, yet inherently complex and knowledge-intensive phase of software development. It requires deep domain expertise, development experience, architectural knowledge, careful trade-offs among competing quality attributes, and the ability to adapt to evolving requirements. Traditionally, this process is time-consuming and labor-intensive, and relies heavily on architects, often resulting in limited design alternatives, especially under the pressures of agile development. While Large Language Model (LLM)-based agents have shown promising performance across various SE tasks, their application to architecture design remains relatively scarce and requires more exploration, particularly in light of diverse domain knowledge and complex decision-making. To address the challenges, we proposed MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design. MAAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints enriched with quality attributes-based evaluation reports. We then evaluated MAAD through a case study and comparative experiments against MetaGPT, a state-of-the-art MAS baseline. Our results show that MAAD's superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports. Feedback from industrial architects across 11 requirements specifications further reinforces MAAD's practical usability. We finally explored the performance of the MAAD framework with three LLMs (GPT-4o, DeepSeek-R1, and Llama 3.3) and found that GPT-4o exhibits better performance in producing architecture design, emphasizing the importance of LLM selection in MAS-driven architecture design.