AI SEJun 17, 2025

AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection

arXiv:2506.14470v17.82 citationsh-index: 3Has CodeICSME

Originality Incremental advance

AI Analysis

This addresses the critical challenge of detecting code clones to reduce software maintenance costs and vulnerability risks, but it is incremental as it builds on existing AST-based methods by empirically comparing hybrid representations.

This paper tackles the problem of code clone detection by evaluating hybrid graph representations combining Abstract Syntax Trees (ASTs) with semantic graphs like Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs) using Graph Neural Networks (GNNs). The results show that AST+CFG+DFG enhances accuracy for some GNN models, while Flow-Augmented ASTs often harm performance, and GMN outperforms others even with standard ASTs.

As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten vulnerability risks, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) dominate deep learning-based code clone detection due to their precise syntactic structure representation, but they inherently lack semantic depth. Recent studies address this by enriching AST-based representations with semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs). However, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection. In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs (FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid representations impact GNNs differently: while AST+CFG+DFG consistently enhances accuracy for convolution- and attention-based models (Graph Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST frequently introduces structural complexity that harms performance. Notably, GMN outperforms others even with standard AST representations, highlighting its superior cross-code similarity detection and reducing the need for enriched structures.

View on arXiv PDF Code

Similar