MEMLFeb 17, 2021

Data-Driven Logistic Regression Ensembles With Applications in Genomics

arXiv:2102.08591v71 citations
Originality Incremental advance
AI Analysis

This work addresses the need for effective statistical methods in genomics to study disease genetics, offering practical tools for researchers, though it appears incremental by building on existing regularization and ensembling techniques.

The paper tackles high-dimensional binary classification in genomics by introducing a novel approach that integrates regularization with ensembling to improve prediction accuracy and biomarker identification, demonstrating strong predictive performance in simulations and cancer genomics datasets.

Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Effective statistical methods should excel in both prediction accuracy and biomarker identification. We introduce a novel approach to high-dimensional binary classification that integrates regularization with ensembling techniques. The method constructs compact ensembles of interpretable models derived by optimizing a global objective function. In medical genomics applications, the proposed approach identifies critical biomarkers overlooked by competing methods. We develop a variable importance ranking system to help researchers prioritize promising genes. The method's asymptotic properties are established, and an efficient computational algorithm is provided. Through extensive simulations across complex scenarios and analysis of cancer genomics datasets, we demonstrate strong predictive performance. Based on the numerical experiments, we offer practical guidelines for determining optimal ensemble size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes