Towards a Measure of Algorithm Similarity
This provides a pragmatic similarity metric for applications like clone detection or program synthesis, though it is incremental as it builds on existing equivalence and similarity notions.
The paper tackles the problem of measuring similarity between algorithms for the same problem, which is challenging due to uncomputability and competing notions, by introducing EMOC (Evaluation-Memory-Operations-Complexity) to embed algorithm implementations into a feature space. They compile the PACD dataset of Python implementations and show EMOC supports clustering, classification, near-duplicate detection, and diversity quantification in LLM-generated programs.
Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.