SE AIFeb 11

PELLI: Framework to effectively integrate LLMs for quality software generation

arXiv:2602.10808v1

Originality Incremental advance

AI Analysis

This provides a practical guide for developers to integrate LLMs into software generation while adhering to quality standards, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of inconsistent code quality from LLMs by proposing PELLI, a framework for iterative analysis and comprehensive evaluation, finding that GPT-4T and Gemini performed slightly better based on metrics for maintainability, performance, and reliability.

Recent studies have revealed that when LLMs are appropriately prompted and configured, they demonstrate mixed results. Such results often meet or exceed the baseline performance. However, these comparisons have two primary issues. First, they mostly considered only reliability as a comparison metric and selected a few LLMs (such as Codex and ChatGPT) for comparision. This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI). PELLI is an iterative analysis-based process that upholds high-quality code changes. We extended the state-of-the-art by performing a comprehensive evaluation that generates quantitative metrics for analyzing three primary nonfunctional requirements (such as maintainability, performance, and reliability) while selecting five popular LLMs. For PELLI's applicability, we selected three application domains while following Python coding standards. Following this framework, practitioners can ensure harmonious integration between LLMs and human developers, ensuring that their potential is fully realized. PELLI can serve as a practical guide for developers aiming to leverage LLMs while adhering to recognized quality standards. This study's outcomes are crucial for advancing LLM technologies in real-world applications, providing stakeholders with a clear understanding of where these LLMs excel and where they require further refinement. Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better. We also found that prompt design can influence the overall code quality. In addition, each application domain demonstrated high and low scores across various metrics, and even within the same metrics across different prompts.

View on arXiv PDF

Similar