SELGPLJan 19, 2021

Improving type information inferred by decompilers with supervised machine learning

arXiv:2101.08116v22 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of making decompiled code more interpretable for reverse engineers, representing an incremental improvement over current decompilation techniques.

The paper tackles the problem of improving type information inferred by decompilers by using supervised machine learning models to predict function return types, achieving a 79.1% F1-measure compared to 30% for the best existing decompiler.

In software reverse engineering, decompilation is the process of recovering source code from binary files. Decompilers are used when it is necessary to understand or analyze software for which the source code is not available. Although existing decompilers commonly obtain source code with the same behavior as the binaries, that source code is usually hard to interpret and certainly differs from the original code written by the programmer. Massive codebases could be used to build supervised machine learning models aimed at improving existing decompilers. In this article, we build different classification models capable of inferring the high-level type returned by functions, with significantly higher accuracy than existing decompilers. We automatically instrument C source code to allow the association of binary patterns with their corresponding high-level constructs. A dataset is created with a collection of real open-source applications plus a huge number of synthetic programs. Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure. Moreover, we document the binary patterns used by our classifier to allow their addition in the implementation of existing decompilers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes