LGJun 1
RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine LearningAmanda S Barnard
Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where scientific data live, they interact: an unstable selection inflates the variance of an already-optimistic score, and standard remedies for one rarely address the other. RobustModelMaker is a Python framework that couples bootstrap stability selection with strict nested cross-validation, performs all preprocessing and selection inside each fold, and produces a stability-tested feature subset together with a leakage-safe performance estimate. The framework supports nine algorithms across binary classification, multiclass classification, and regression. Behaviour is verified by a deterministic test suite spanning unit, performance, and reproducibility checks on three real scientific datasets comparing to three alternative selectors (ANOVA F-test, recursive feature elimination with cross-validation, and Boruta) on both predictive score and a Jaccard measure of selection stability. RobustModelMaker is competitive in score with the best alternative selector on each dataset, and occupies a position on the joint score-stability frontier that none of the alternatives match across all three task types. Two example applications, ovarian cancer biomarker discovery from the PLCO Trial and critical-temperature regression on the UCI Superconductivity Data, illustrate how the framework is used in practice and what trade-offs become visible when stability is treated as a first-class deliverable rather than an emergent property.
LGMay 12
OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful MissingnessAmanda S Barnard
Missing values are routinely treated as defects to be eliminated through deletion or imputation prior to machine learning. In many applied domains, however, missingness itself carries information, reflecting experimental constraints, measurement choices, or systematic mechanisms tied to the data-generating process. Eliminating or masking this structure can distort class boundaries, introduce bias, and reduce generalisability; particularly in imbalanced datasets where minority classes are already under-represented. OverNaN is a lightweight, NaN-aware oversampling framework designed to address class imbalance without erasing missingness structure. It extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than repairing missing data, OverNaN treats missingness as part of the feature space over which synthetic samples are generated. This paper situates OverNaN within the broader landscape of imbalanced learning, missing-data handling, and NaN-tolerant algorithms. Using representative examples included with the software, we demonstrate that meaningful missingness can be retained during oversampling without introducing artificial certainty. OverNaN is intended for practitioners working with small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is unavoidable and often informative.
LGJun 29, 2025
BenchMake: Turn any scientific data set into a reproducible benchmarkAmanda S Barnard
Benchmark data sets are a cornerstone of machine learning development and applications, ensuring new methods are robust, reliable and competitive. The relative rarity of benchmark sets in computational science, due to the uniqueness of the problems and the pace of change in the associated domains, makes evaluating new innovations difficult for computational scientists. In this paper a new tool is developed and tested to potentially turn any of the increasing numbers of scientific data sets made openly available into a benchmark accessible to the community. BenchMake uses non-negative matrix factorisation to deterministically identify and isolate challenging edge cases on the convex hull (the smallest convex set that contains all existing data instances) and partitions a required fraction of matched data instances into a testing set that maximises divergence and statistical significance, across tabular, graph, image, signal and textual modalities. BenchMake splits are compared to establish splits and random splits using ten publicly available benchmark sets from different areas of science, with different sizes, shapes, distributions.