LG MLDec 16, 2024

Stabilizing Machine Learning for Reproducible and Explainable Results: A Novel Validation Approach to Subject-Specific Insights

Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

arXiv:2412.16199v12.63 citationsh-index: 32Has CodeComput. Methods Programs Biomed.

Originality Incremental advance

AI Analysis

This work addresses the problem of high costs and variability in subject-specific models for clinical researchers, offering a practical and explainable alternative, though it is incremental as it builds on existing validation methods.

The paper tackled the challenge of developing subject-specific machine learning models for medical research by proposing a novel validation approach that uses a single general Random Forest model to ensure reproducible performance and robust feature importance analysis at both group and subject-specific levels, achieving consistent identification of key features and improved accuracy across nine diverse datasets.

Machine Learning is transforming medical research by improving diagnostic accuracy and personalizing treatments. General ML models trained on large datasets identify broad patterns across populations, but their effectiveness is often limited by the diversity of human biology. This has led to interest in subject-specific models that use individual data for more precise predictions. However, these models are costly and challenging to develop. To address this, we propose a novel validation approach that uses a general ML model to ensure reproducible performance and robust feature importance analysis at both group and subject-specific levels. We tested a single Random Forest (RF) model on nine datasets varying in domain, sample size, and demographics. Different validation techniques were applied to evaluate accuracy and feature importance consistency. To introduce variability, we performed up to 400 trials per subject, randomly seeding the ML algorithm for each trial. This generated 400 feature sets per subject, from which we identified top subject-specific features. A group-specific feature importance set was then derived from all subject-specific results. We compared our approach to conventional validation methods in terms of performance and feature importance consistency. Our repeated trials approach, with random seed variation, consistently identified key features at the subject level and improved group-level feature importance analysis using a single general model. Subject-specific models address biological variability but are resource-intensive. Our novel validation technique provides consistent feature importance and improved accuracy within a general ML model, offering a practical and explainable alternative for clinical research.

View on arXiv PDF Code

Similar