SESep 19, 2019

DIRE: A Neural Approach to Decompiled Identifier Naming

Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, Bogdan Vasilescu

arXiv:1909.09029v230.7142 citations

Originality Incremental advance

AI Analysis

This addresses the issue of code understandability for reverse engineers and security analysts, but it is an incremental improvement as it builds on existing decompilation techniques.

The paper tackles the problem of recovering meaningful variable names in decompiled code, which decompilers fail to reconstruct, and shows that their DIRE method can predict original variable names with 74.3% accuracy on a corpus of 164,632 binaries.

The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Decompilers can reconstruct much of the information that is lost during the compilation process (e.g., structure and type information). Unfortunately, they do not reconstruct semantically meaningful variable names, which are known to increase code understandability. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.

View on arXiv PDF

Similar