SEApr 29

Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems

arXiv:2604.2666725.0
AI Analysis

This work addresses the challenge of predicting residual defects in Python systems for software engineers, but the approach is incremental as it applies existing supervised models to a new dataset.

The study investigates whether post-release defects in Python systems can be predicted using machine learning and deep learning. Supervised metric-based models (RandomForest, XGBoost, CatBoost) achieved 0.85-0.9 recall and reduced false negatives by an order of magnitude, while LLMs and unsupervised models failed.

Python's dynamic nature complicates testing and increases the possibility that some defects evade detection, so an effective fault prediction becomes essential. We examine whether post-release faults can be predicted using modern ML and DL. Using a balanced dataset of over 4,000 labeled faults with 83 product, process, statistical, and Python-specific metrics plus normalized code representations, we conduct cross-project experiments. LLMs and unsupervised models fail to distinguish residual from non-residual faults, while supervised metric-based models (RandomForest, XGBoost, CatBoost) perform far better, yielding a 0.85-0.9 recall and cutting false negatives by an order of magnitude. Process metrics, especially age, churn, and developer-activity, alongside class and file size, consistently prove most predictive. Notably, the Principal Component Analysis shows that metrics and code embeddings occupy distinct regions of the representation space, suggesting that they capture complementary rather than redundant information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes