Insu Yun

CR
h-index12
4papers
54citations
Novelty60%
AI Score52

4 Papers

89.8LGMay 12Code
CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Dongjun Lee, Ga-eun Bae, Insu Yun

Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

79.8CRMay 11
Agentic Fuzzing: Opportunities and Challenges

Junyoung Park, Insu Yun

Fuzzers and static analyzers find many bugs but struggle with logic bugs in mature codebases. Triggering such a bug often requires multi-step reasoning that produces no distinctive execution feedback, and variants can appear across implementations too different for a single pattern to match. Recent LLM-assisted approaches help, but they use LLMs as auxiliaries rather than as the reasoning engine. We propose agentic fuzzing, a bug-finding approach seeded by historical bugs in which deep agents perform the reasoning directly. Given a reference bug, the agent analyzes its root cause, hypothesizes new scenarios elsewhere in the codebase that may share that cause, and verifies each hypothesis by generating and running proof-of-concept code. This lets the agent find variants that differ completely in trigger path or code structure from the reference. We identify three practical challenges in implementing agentic fuzzing: harness engineering, redundant investigations across seeds with similar root causes, and scheduling seeds in a large corpus. We address these in AFuzz through a four-stage agent pipeline, scenario coverage that deduplicates previously explored scenarios, and a DPP-MAP scheduler that orders seeds by diversity. We ran AFuzz on the V8 JavaScript engine for about one month, finding 40 bugs (including three duplicates), receiving a total $35,000 bounty, and being assigned two CVEs. AFuzz also found 19 bugs (including one duplicate) in SpiderMonkey and JavaScriptCore using the seeds from V8. However, agentic fuzzing is in its early stages with several remaining open problems we discuss in the paper. Still, we think it points to a promising direction for finding logic bugs.

CRSep 18, 2025
ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System

Taesoo Kim, HyungSeok Han, Soyeon Park et al.

We present ATLANTIS, the cyber reasoning system developed by Team Atlanta that won 1st place in the Final Competition of DARPA's AI Cyber Challenge (AIxCC) at DEF CON 33 (August 2025). AIxCC (2023-2025) challenged teams to build autonomous cyber reasoning systems capable of discovering and patching vulnerabilities at the speed and scale of modern software. ATLANTIS integrates large language models (LLMs) with program analysis -- combining symbolic execution, directed fuzzing, and static analysis -- to address limitations in automated vulnerability discovery and program repair. Developed by researchers at Georgia Institute of Technology, Samsung Research, KAIST, and POSTECH, the system addresses core challenges: scaling across diverse codebases from C to Java, achieving high precision while maintaining broad coverage, and producing semantically correct patches that preserve intended behavior. We detail the design philosophy, architectural decisions, and implementation strategies behind ATLANTIS, share lessons learned from pushing the boundaries of automated security when program analysis meets modern AI, and release artifacts to support reproducibility and future research.

CRMar 1, 2019
Automatic Techniques to Systematically Discover New Heap Exploitation Primitives

Insu Yun, Dhaval Kapil, Taesoo Kim

Heap exploitation techniques to abuse the metadata of allocators have been widely studied since they are application independent and can be used in restricted environments that corrupt only metadata. Although prior work has found several interesting exploitation techniques, they are ad-hoc and manual, which cannot effectively handle changes or a variety of allocators. In this paper, we present a new naming scheme for heap exploitation techniques that systematically organizes them to discover the unexplored space in finding the techniques and ArcHeap, the tool that finds heap exploitation techniques automatically and systematically regardless of their underlying implementations. For that, ArcHeap generates a set of heap actions (e.g. allocation or deallocation) by leveraging fuzzing, which exploits common designs of modern heap allocators. Then, ArcHeap checks whether the actions result in impact of exploitations such as arbitrary write or overlapped chunks that efficiently determine if the actions can be converted into the exploitation technique. Finally, from these actions, ArcHeap generates Proof-of-Concept code automatically for an exploitation technique. We evaluated ArcHeap with real-world allocators --- ptmalloc, jemalloc, and tcmalloc --- and custom allocators from the DARPA Cyber Grand Challenge. ArcHeap successfully found 14 out of 16 known exploitation techniques and found five new exploitation techniques in ptmalloc. Moreover, ArcHeap found several exploitation techniques for jemalloc, tcmalloc, and even for the custom allocators. Further, ArcHeap can automatically show changes in exploitation techniques along with version change in ptmalloc using differential testing.