LG CL PL SEMay 30, 2025

SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation

Ivan Petrukha, Yana Kurliak, Nataliia Stulova

arXiv:2505.24324v17.12 citationsh-index: 22025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)

Originality Synthesis-oriented

AI Analysis

This addresses a domain-specific problem for developers and researchers evaluating LLMs in Swift, but it is incremental as it builds on existing multilingual benchmark approaches.

The authors tackled the lack of high-quality evaluation benchmarks for LLM-generated code in Swift by creating SwiftEval, a hand-crafted benchmark of 28 problems, and found that LLM scores drop significantly for language-specific features, especially in smaller models.

In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming languages, such as Swift, with high quality. By examining widely established multilingual benchmarks like HumanEval-XL and MultiPL-E, we identified critical issues specific to their Swift components, making them insufficient or even irrelevant for assessing LLM coding capabilities on Swift. Unlike these existing approaches, which prioritize rapid scaling and generalization by automatically translating Python-centric benchmarks with LLMs, we adopt a quality-over-quantity methodology. We present SwiftEval, the first Swift-oriented benchmark consisting of 28 carefully hand-crafted problems, and evaluate 44 popular Code LLMs on it. Our results show significant LLM scores drop for problems requiring language-specific features, most noticeable in the models of smaller sizes.

View on arXiv PDF

Similar