Iulian Neamtiu

CR
3papers
24citations
Novelty38%
AI Score40

3 Papers

CRSep 27, 2024
Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation

Chun Jie Chong, Zhihao Yao, Iulian Neamtiu

Generating code via a LLM (rather than writing code from scratch), has exploded in popularity. However, the security implications of LLM-generated code are still unknown. We performed a study that compared the security and quality of human-written code with that of LLM-generated code, for a wide range of programming tasks, including data structures, algorithms, cryptographic routines, and LeetCode questions. To assess code security we used unit testing, fuzzing, and static analysis. For code quality, we focused on complexity and size. We found that LLM can generate incorrect code that fails to implement the required functionality, especially for more complicated tasks; such errors can be subtle. For example, for the cryptographic algorithm SHA1, LLM generated an incorrect implementation that nevertheless compiles. In cases where its functionality was correct, we found that LLM-generated code is less secure, primarily due to the lack of defensive programming constructs, which invites a host of security issues such as buffer overflows or integer overflows. Fuzzing has revealed that LLM-generated code is more prone to hangs and crashes than human-written code. Quality-wise, we found that LLM generates bare-bones code that lacks defensive programming constructs, and is typically more complex (per line of code) compared to human-written code. Next, we constructed a feedback loop that asked the LLM to re-generate the code and eliminate the found issues (e.g., malloc overflow, array index out of bounds, null dereferences). We found that the LLM fails to eliminate such issues consistently: while succeeding in some cases, we found instances where the re-generated, supposedly more secure code, contains new issues; we also found that upon prompting, LLM can introduce issues in files that were issues-free before prompting.

64.2SEApr 25Code
Can LLMs be Effective Code Contributors? A Study on Open-source Projects

Chun Jie Chong, Muyeed Ahmed, Zhihao et al.

LLM-generated code is widely used, and the share of committed code produced by LLMs is expected to increase. However, we are not at a point where LLMs can be effective contributors to production code. We present an approach that exposes the shortcomings of LLM generation on such projects, and proposes recommendations; the targets of our study are sizable open-source projects, e.g., FFmpeg and wolfSSL. First, we developed a framework that uses verification and validation to evaluate a given LLM's suitability to fix or add features to an existing project. Second, we apply the framework to 212 commits (bug fixes and small feature improvements) in eight popular open-source projects and three LLMs: GPT-4o, Ministral3, and Qwen3-Coder. The success rate varied from 0% to 60% depending on the project. The LLMs failed in a variety of ways, from generating syntactically incorrect code, to producing code that fails basic (static) verification, or validation via the project's test suite. In particular, the LLMs struggle with generating new code, handling contexts (function or file) outside a certain size range, and in many cases their success is due to parroting code changes they have been trained on.

CRJun 5, 2020
Knock, Knock. Who's There? On the Security of LG's Knock Codes

Raina Samuel, Philipp Markert, Adam J. Aviv et al.

Knock Codes are a knowledge-based unlock authentication scheme used on LG smartphones where a user enters a code by tapping or "knocking" a sequence on a 2x2 grid. While a lesser used authentication method, as compared to PINs or Android patterns, there is likely a large number of Knock Code users; we estimate, 700,000--2,500,000 in the US alone. In this paper, we studied Knock Codes security asking participants to select codes on mobile devices in three settings: a control treatment, a blocklist treatment, and a treatment with a larger, 2x3 grid. We find that Knock Codes are significantly weaker than other deployed authentication, e.g., PINs or Android patterns. In a simulated attacker setting, 2x3 grids offered no additional security, but blocklisting was more beneficial, making Knock Codes' security similar to Android patterns. Participants expressed positive perceptions of Knock Codes, but usability was challenged. SUS values were "marginal" or "ok" across treatments. Based on these findings, we recommend deploying blacklists for selecting a Knock Code because it improves security but has limited impact on usability perceptions.