CLPLSEApr 12, 2024

Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance

arXiv:2404.08817v241 citationsh-index: 14ACL
Originality Incremental advance
AI Analysis

This work addresses code similarity evaluation for developers and researchers, but it is incremental as it builds on existing AST-based methods.

The paper revisits code similarity evaluation by applying Abstract Syntax Tree (AST) edit distance across programming languages, showing it effectively captures code structures with high correlation to established metrics. It proposes an optimized and adaptable metric, an enhanced version of Tree Similarity of Edit Distance (TSED), that performs well across all tested languages.

This paper revisits recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance in diverse programming languages. In particular, we explore the usefulness of these metrics and compare them to traditional sequence similarity metrics. Our experiments showcase the effectiveness of AST editing distance in capturing intricate code structures, revealing a high correlation with established metrics. Furthermore, we explore the strengths and weaknesses of AST editing distance and prompt-based GPT similarity scores in comparison to BLEU score, execution match, and Jaccard Similarity. We propose, optimize, and publish an adaptable metric that demonstrates effectiveness across all tested languages, representing an enhanced version of Tree Similarity of Edit Distance (TSED).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes