SEAIPLJun 28, 2023

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

arXiv:2306.16171v181 citationsh-index: 45
Originality Synthesis-oriented
AI Analysis

This review addresses the need for a comprehensive overview of techniques and applications in code similarity measurement for software engineering researchers and practitioners, though it is incremental as it synthesizes existing work.

The paper conducted a systematic literature review and meta-analysis on source code similarity measurement and clone detection, analyzing 136 primary studies and identifying 80 tools, 12 datasets, and key challenges like lack of reliable datasets and hybrid methods.

Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes