CRSEMay 29

How to Compare the Security of Code Written by Humans to LLM-generated Code

arXiv:2606.0018631.6h-index: 15Has Code
Predicted impact top 58% in CR · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers and practitioners evaluating the security implications of LLM-generated code, this work provides a reproducible framework for empirical comparisons.

The paper addresses the lack of a standardized method for comparing the security of code written by humans versus LLMs. It proposes an automated framework for conducting comparative studies and validates it through a feasibility study, providing an experimental blueprint for species-fair comparisons.

Large language models (LLMs) are rapidly transforming how software is created and maintained. Comparing LLM-generated code against human-written standards is essential to determine whether these new tools uphold or erode the security baselines established by professional developers. Yet, we lack a standardized method for empirically comparing the security of code produced through human-LLM collaboration against LLM-only, or traditional human-only methods. To facilitate this, we propose an automated framework for conducting comparative studies across human-only, LLM-only, and hybrid conditions. Our approach automates the logging of prompts, timing, and experimental settings, measuring outcomes through multi-dimensional static and dynamic quality analysis. We provide an open-source implementation of this framework to ensure that future researchers can conduct reproducible, species-fair experiments. Importantly, we validate the framework via a feasibility study, providing an experimental blueprint for ``species-fair'' comparisons between human and AI subjects. By sharing lessons learned, we establish a foundation for empirical research on human and LLM-generated code for software security.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes