PLCLMar 8, 2024

LLM4Decompile: Decompiling Binary Code with Large Language Models

arXiv:2403.05286v383 citationsh-index: 5Has CodeEMNLP
Originality Highly original
AI Analysis

This work addresses the challenge of producing readable and executable decompiled code for software analysis and security, representing a significant advancement over traditional tools.

The paper tackles the problem of decompiling binary code to high-level source code by proposing LLM4Decompile, a series of open-source large language models, which outperforms GPT-4o and Ghidra by over 100% in re-executability rate on benchmarks and achieves a further 16.2% improvement with refinement.

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes