Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection
This addresses the problem of detecting similar code fragments across different programming languages for software maintenance, but it is incremental as it compares existing models.
The paper tackled code clone detection by arguing that source code is a graph rather than a sequence, and showed that a graph-based model (CodeGraph) outperformed a sequence-based model (CodeBERT) on benchmark datasets, particularly for cross-lingual clones.
Source code clone detection is the task of finding code fragments that have the same or similar functionality, but may differ in syntax or structure. This task is important for software maintenance, reuse, and quality assurance (Roy et al. 2009). However, code clone detection is challenging, as source code can be written in different languages, domains, and styles. In this paper, we argue that source code is inherently a graph, not a sequence, and that graph-based methods are more suitable for code clone detection than sequence-based methods. We compare the performance of two state-of-the-art models: CodeBERT (Feng et al. 2020), a sequence-based model, and CodeGraph (Yu et al. 2023), a graph-based model, on two benchmark data-sets: BCB (Svajlenko et al. 2014) and PoolC (PoolC no date). We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones. To the best of our knowledge, this is the first work to demonstrate the superiority of graph-based methods over sequence-based methods on cross-lingual code clone detection.