SE IR PLJul 26, 2019

Scalable Source Code Similarity Detection in Large Code Repositories

arXiv:1907.11817v14 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of scalable code similarity detection for software developers and maintainers, offering an incremental improvement over existing methods.

The paper tackles the problem of automatically detecting similar code fragments in large repositories to address issues like bug propagation and maintenance overhead, presenting an approach based on control flow graph fingerprinting that shows effectiveness and efficiency in experiments compared to other solutions.

Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions.

View on arXiv PDF

Similar