Greedy Biomarker Discovery in the Genome with Applications to Antimicrobial Resistance
This work addresses the challenge of biomarker discovery in genomics for predicting antimicrobial resistance in human pathogens, representing an incremental improvement in handling high-dimensional data.
The authors tackled the problem of predicting antimicrobial resistance from genomic data with extremely high feature counts (over 10^7) by extending the Set Covering Machine (SCM) algorithm. Their results show that SCM achieved favorable sparsity and accuracy compared to L1/L2 regularized SVMs and CART decision trees, and it was the only method that could handle the full feature space without preprocessing.
The Set Covering Machine (SCM) is a greedy learning algorithm that produces sparse classifiers. We extend the SCM for datasets that contain a huge number of features. The whole genetic material of living organisms is an example of such a case, where the number of feature exceeds 10^7. Three human pathogens were used to evaluate the performance of the SCM at predicting antimicrobial resistance. Our results show that the SCM compares favorably in terms of sparsity and accuracy against L1 and L2 regularized Support Vector Machines and CART decision trees. Moreover, the SCM was the only algorithm that could consider the full feature space. For all other algorithms, the latter had to be filtered as a preprocessing step.