Structured Generative Models of Natural Source Code
This addresses the challenge of generating human-readable source code, which is incremental as it builds on existing generative modeling approaches by adding structural elements.
The paper tackles the problem of building generative models for natural source code by introducing a family of models that incorporate sequential and hierarchical structure, learn distributed representations, and integrate with compilers, showing that including appropriate structure greatly improves the models as measured by test program probability.
We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.