IRJan 30, 2022

Similarity Search on Computational Notebooks

arXiv:2201.12786v11 citations
Originality Incremental advance
AI Analysis

This addresses the tedious task of searching for computational notebooks for data scientists, but it is incremental as it builds on existing similarity search techniques.

The paper tackles the problem of manually searching for reusable computational notebooks by proposing a similarity search framework that uses set-based and graph-based measures to find top-k notebooks with similar contents. Experiments on Kaggle notebooks show that the graph-based similarity method achieves high accuracy and efficiency.

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes