SEAILGMar 4, 2024

Can LLMs Generate Architectural Design Decisions? -An Exploratory Empirical study

arXiv:2403.01709v149 citationsh-index: 8ICSA
Originality Incremental advance
AI Analysis

This addresses the slow adoption of Architecture Decision Records in software development by potentially automating their generation, but it is an incremental step as it only explores feasibility without achieving human-level performance.

The study explored whether large language models (LLMs) can generate architectural design decisions (ADRs) from context, finding that models like GPT-4 produce relevant and accurate decisions in zero-shot settings, though they fall short of human performance, with cost-effective alternatives achieving similar results through few-shot or fine-tuning approaches.

Architectural Knowledge Management (AKM) involves the organized handling of information related to architectural decisions and design within a project or organization. An essential artifact of AKM is the Architecture Decision Records (ADR), which documents key design decisions. ADRs are documents that capture decision context, decision made and various aspects related to a design decision, thereby promoting transparency, collaboration, and understanding. Despite their benefits, ADR adoption in software development has been slow due to challenges like time constraints and inconsistent uptake. Recent advancements in Large Language Models (LLMs) may help bridge this adoption gap by facilitating ADR generation. However, the effectiveness of LLM for ADR generation or understanding is something that has not been explored. To this end, in this work, we perform an exploratory study that aims to investigate the feasibility of using LLM for the generation of ADRs given the decision context. In our exploratory study, we utilize GPT and T5-based models with 0-shot, few-shot, and fine-tuning approaches to generate the Decision of an ADR given its Context. Our results indicate that in a 0-shot setting, state-of-the-art models such as GPT-4 generate relevant and accurate Design Decisions, although they fall short of human-level performance. Additionally, we observe that more cost-effective models like GPT-3.5 can achieve similar outcomes in a few-shot setting, and smaller models such as Flan-T5 can yield comparable results after fine-tuning. To conclude, this exploratory study suggests that LLM can generate Design Decisions, but further research is required to attain human-level generation and establish standardized widespread adoption.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes