Keno Hassler

CRFeb 8, 2023Code

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Hossein Hajipour, Keno Hassler, Thorsten Holz et al.

Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. The training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively assessed for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models. In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. To achieve this, we present an approach to approximate inversion of the black-box code generation models based on few-shot prompting. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we establish a collection of diverse non-secure prompts for various vulnerability scenarios using our method. This dataset forms a benchmark for evaluating and comparing the security weaknesses in code language models.

23.3CRMar 27Code

A Comparative Study of Fuzzers and Static Analysis Tools for Finding Memory Unsafety in C and C++

Keno Hassler, Philipp Görz, Stephan Lipp

Over 70% of security vulnerabilities in critical software systems today result from memory safety violations. To address this challenge, fuzzing and static analysis are widely used automated methods to discover such vulnerabilities. Fuzzing generates random program inputs to identify faults at runtime, while static analysis reasons about the code to detect potential vulnerabilities. Although these techniques share a common goal, they take fundamentally different approaches and have evolved largely independently. In this paper, we present an empirical analysis of five static analyzers and 13 fuzzers, applied to over 100 known security vulnerabilities in C/C++ programs. We measure the detection rate for each tool and vulnerability to evaluate how the approaches differ and complement each other. We find that fuzzers discover a very similar set of bugs, while static analyzers report more diverse sets, and identify clear leaders for each group. Comparing the union of all fuzzers with that of all static analyzers, we observe they are nearly disjoint. In a second step, we manually validate the report-to-bug mapping we developed for the evaluation and discuss more qualitative aspects of limitations, usability, and integration into the development process. We examine how widely these bug finding tools are used in critical open-source projects. We advise developers on choosing tools to harden their software and identify barriers to adoption as well as future research opportunities.

Keno Hassler

2 Papers