Deep Data Flow Analysis
This work is significant for compiler engineers, as it aims to enable machine learning to effectively perform data flow analysis, a critical component for automatic heuristic design in compilers.
This paper addresses the inability of machine learning to replicate data flow analysis in compilers, which is crucial for optimization. They introduce ProGraML, a language-independent representation of whole-program semantics, and an open dataset of 461k LLVM IR files with 15.4M data flow results. Using ProGraML, they demonstrate that standard data flow analyses can be learned via MPNNs, leading to improved performance in downstream compiler optimization.
Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow analysis that are critical to making good optimization decisions. This must change for machine learning to become the dominant technology in compiler heuristics. To this end, we propose ProGraML - Program Graphs for Machine Learning - a language-independent, portable representation of whole-program semantics for deep learning. To benchmark current and future learning techniques for compiler analyses we introduce an open dataset of 461k Intermediate Representation (IR) files for LLVM, covering five source programming languages, and 15.4M corresponding data flow results. We formulate data flow analysis as an MPNN and show that, using ProGraML, standard analyses can be learned, yielding improved performance on downstream compiler optimization tasks.