SEMar 30

Toward Functional and Non-Functional Evaluation of Application-Level Code Generation

arXiv:2602.0346267.7h-index: 5
AI Analysis

This addresses the need for better evaluation of real-world software generation, though it is incremental as it builds on existing benchmarks.

The paper tackles the problem of evaluating application-level code generation by LLMs, introducing RAL-Bench to assess functional correctness and non-functional quality, with results showing no model exceeds a 45% functional score.

Large language models (LLMs) have achieved strong performance on code generation. However, most prior evaluations focus on snippet-level outputs, such as function generation or repository completion. These settings do not fully evaluate application-level code generation, where the goal is to produce a runnable repository with coherent multi-file structure, dependency support, and end-to-end executability. In addition, real-world software quality depends not only on functional correctness but also on non-functional quality attributes, such as maintainability and security. In this paper, we present RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, RAL-Bench derives a concise natural-language requirement from a high-quality reference project, constructs black-box system tests for both functional correctness and non-functional quality attributes. It also retains only the candidate tests that pass on the reference repository. Under this unified evaluation protocol, functional correctness is measured by the system test pass rate, while non-functional quality is evaluated along five ISO/IEC 25010-inspired dimensions, with per-dimension diagnostics and reference-normalized scoring.We evaluate 16 frontier LLMs under a controlled zero-shot setting with greedy decoding. The results show that functional correctness remains the primary bottleneck in application-level code generation, while non-functional quality also remains challenging. Under our evaluation protocol, no model exceeds a 45\% functional score. These findings suggest that strong performance on existing code generation benchmarks does not yet translate to strong performance on application-level repository generation. This result highlights the need for evaluation settings that directly assess end-to-end repository generation rather than relying only on snippet-level success.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes