ReSplit: Improving the Structure of Jupyter Notebooks by Re-Splitting Their Cells
This addresses the issue of unstructured code for data scientists and programmers using Jupyter notebooks, but it is incremental as it builds on existing analysis of notebook cells.
The authors tackled the problem of poor structure in Jupyter notebooks by developing ReSplit, an algorithm that automatically re-splits cells based on definition-usage chains, and found that in 29.5% of cases, human experts preferred the re-split version over the original.
Jupyter notebooks represent a unique format for programming - a combination of code and Markdown with rich formatting, separated into individual cells. We propose to perceive a Jupyter Notebook cell as a simplified and raw version of a programming function. Similar to functions, Jupyter cells should strive to contain singular, self-contained actions. At the same time, research shows that real-world notebooks fail to do so and suffer from the lack of proper structure. To combat this, we propose ReSplit, an algorithm for an automatic re-splitting of cells in Jupyter notebooks. The algorithm analyzes definition-usage chains in the notebook and consists of two parts - merging and splitting the cells. We ran the algorithm on a large corpus of notebooks to evaluate its performance and its overall effect on notebooks, and evaluated it by human experts: we showed them several notebooks in their original and the re-split form. In 29.5% of cases, the re-split notebook was selected as the preferred way of perceiving the code. We analyze what influenced this decision and describe several individual cases in detail.