Exploring the Limits of ChatGPT in Software Security Applications
This work addresses the problem of understanding LLM capabilities in system security for researchers and practitioners, but it is incremental as it applies existing methods to a new domain.
The paper explored the limits of ChatGPT in software security applications like vulnerability detection and decompilation, finding that it excels in code generation and reasoning but has limitations such as processing long code contexts, with GPT-4 showing significant improvements over GPT-3.5 in most tasks.
Large language models (LLMs) have undergone rapid evolution and achieved remarkable results in recent times. OpenAI's ChatGPT, backed by GPT-3.5 or GPT-4, has gained instant popularity due to its strong capability across a wide range of tasks, including natural language tasks, coding, mathematics, and engaging conversations. However, the impacts and limits of such LLMs in system security domain are less explored. In this paper, we delve into the limits of LLMs (i.e., ChatGPT) in seven software security applications including vulnerability detection/repair, debugging, debloating, decompilation, patching, root cause analysis, symbolic execution, and fuzzing. Our exploration reveals that ChatGPT not only excels at generating code, which is the conventional application of language models, but also demonstrates strong capability in understanding user-provided commands in natural languages, reasoning about control and data flows within programs, generating complex data structures, and even decompiling assembly code. Notably, GPT-4 showcases significant improvements over GPT-3.5 in most security tasks. Also, certain limitations of ChatGPT in security-related tasks are identified, such as its constrained ability to process long code contexts.