SEAug 30, 2021

CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

arXiv:2108.12987v2665 citations
Originality Incremental advance
AI Analysis

This work addresses code summarization for program comprehension and maintenance, presenting an incremental improvement over existing methods by enhancing AST utilization.

The paper tackles the problem of code summarization by proposing CAST, a model that hierarchically splits and reconstructs abstract syntax trees (ASTs) to better capture syntactic information, resulting in improved performance on benchmarks as demonstrated through extensive experiments.

Code summarization aims to generate concise natural language descriptions of source code, which can help improve program comprehension and maintenance. Recent studies show that syntactic and structural information extracted from abstract syntax trees (ASTs) is conducive to summary generation. However, existing approaches fail to fully capture the rich information in ASTs because of the large size/depth of ASTs. In this paper, we propose a novel model CAST that hierarchically splits and reconstructs ASTs. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, AST representation, together with source code embedding obtained by a vanilla code token encoder, is used for code summarization. Extensive experiments, including the ablation study and the human evaluation, on benchmarks have demonstrated the power of CAST. To facilitate reproducibility, our code and data are available at https://anonymous.4open.science/r/CAST/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes