Neural architectures for resolving references in program code
For researchers in program analysis and decompilation, the paper addresses a specific bottleneck in reference rewriting with a novel architecture that yields strong practical gains.
The paper abstracts reference rewriting in programming languages into direct and indirect indexing tasks, creates synthetic benchmarks, and proposes new sequence-to-sequence architectures that outperform baselines in robustness and scalability, handling examples ten times longer. In a real-world decompilation task, the model reduces error rate by 42%.
Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.