LG AI MLJun 14, 2024

Provably Robust Pre-Trained Ensembles for Biomarker-Based Cancer Classification

arXiv:2406.10087v22.6

Originality Incremental advance

AI Analysis

This work addresses the problem of early cancer detection via liquid biopsies for medical applications, offering an incremental improvement in accuracy and robustness with reduced feature requirements.

The paper tackled robust cancer classification from high-dimensional biomarker data, achieving a high AUC of 0.9929 in binary tasks and an accuracy of 0.9464 in multi-class tasks with only 500 features, while demonstrating robustness under class imbalance.

Certain cancer types, notably pancreatic cancer, are difficult to detect at an early stage, motivating robust biomarker-based screening. Liquid biopsies enable non-invasive monitoring of circulating biomarkers, but typical machine learning pipelines for high-dimensional tabular data (e.g., random forests, SVMs) rely on expensive hyperparameter tuning and can be brittle under class imbalance. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results. Crucially, we demonstrate robustness under class imbalance: empirically via balanced accuracy and minority-class recall across cancer-vs.-noncancer and cancer-vs.-rest settings, and theoretically by showing (i) a prototype-form final layer for Hyperfast that yields prior-insensitive decisions under bounded bias, and (ii) minority-error reductions for majority vote under mild error diversity. Together, these results indicate that pre-trained tabular models and simple ensembling can deliver state-of-the-art accuracy and improved minority-class performance with far fewer features and no additional tuning.

View on arXiv PDF

Similar