CRMay 11

Towards LLM-Based Analysis of Virtualization-Obfuscated Code through Automated Data Generation

Sangjun An, Hyeyeon Park, Yejin Son, Seoksu Lee, Eun-Sun Cho

arXiv:2605.099614.6

Predicted impact top 92% in CR · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the need for scalable analysis of virtualization-obfuscated code, a problem for security analysts dealing with malware or software protection.

The paper tackles the challenge of analyzing virtualization-obfuscated binaries with LLMs by decomposing them into structural units and using a static analysis framework to automatically generate labeled data. The prototype achieves strong performance on real-world obfuscators.

Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.

View on arXiv PDF

Similar