SEJul 23, 2021

Ensemble Models for Neural Source Code Summarization of Subroutines

Alexander LeClair, Aakash Bansal, Collin McMillan

arXiv:2107.11423v119.234 citations

Originality Incremental advance

AI Analysis

This work addresses the need for better documentation generation for programmers by providing an incremental improvement over current neural summarization techniques.

The paper tackled the problem of improving neural source code summarization by leveraging the orthogonal performance differences among existing models through ensemble methods, resulting in a performance boost of up to 14.8%.

A source code summary of a subroutine is a brief description of that subroutine. Summaries underpin a majority of documentation consumed by programmers, such as the method summaries in JavaDocs. Source code summarization is the task of writing these summaries. At present, most state-of-the-art approaches for code summarization are neural network-based solutions akin to seq2seq, graph2seq, and other encoder-decoder architectures. The input to the encoder is source code, while the decoder helps predict the natural language summary. While these models tend to be similar in structure, evidence is emerging that different models make different contributions to prediction quality -- differences in model performance are orthogonal and complementary rather than uniform over the entire dataset. In this paper, we explore the orthogonal nature of different neural code summarization approaches and propose ensemble models to exploit this orthogonality for better overall performance. We demonstrate that a simple ensemble strategy boosts performance by up to 14.8%, and provide an explanation for this boost. The takeaway from this work is that a relatively small change to the inference procedure in most neural code summarization techniques leads to outsized improvements in prediction quality.

View on arXiv PDF

Similar