JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
For researchers developing debugging tools for ML code in Jupyter notebooks, JunoBench fills the gap of lacking benchmarks by providing a curated, reproducible dataset of crashes.
JunoBench introduces the first benchmark dataset of 111 real-world crashes from Python ML Jupyter notebooks, with verified fixes and annotations, to support research on debugging tools for notebook-based ML development.
Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet, few debugging tools are designed for ML code in notebooks, partly, due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks, covering popular ML libraries (e.g., TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific out-of-order execution errors. JunoBench ensures reproducibility and ease of use through a unified environment that reliably reproduces all crashes. By providing realistic crashes, their resolutions, richly annotated labels of crash characteristics, and natural-language diagnostic annotations, JunoBench facilitates research on bug detection, localization, diagnosis, and repair in notebook-based ML development.