CLAIJul 7, 2017

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

arXiv:1707.02275v11153 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses the data bottleneck for researchers working on automated code documentation and generation from natural language.

The authors tackled the problem of limited parallel corpora for automated code documentation and generation by creating a large dataset of 100,000 Python functions with docstrings scraped from GitHub. They provided baseline neural machine translation results and released the dataset to stimulate further research.

Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes