LG CEMar 8, 2014

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani

arXiv:1403.1949v143 citations

Originality Synthesis-oriented

AI Analysis

This work addresses classification challenges in medical datasets like lung cancer, but it is incremental as it combines existing methods without introducing new paradigms.

The paper tackled the problem of building reliable classification models on large, imbalanced datasets with irrelevant features by combining PCA for dimensionality reduction and SMOTE for resampling, applied to a lung cancer dataset, resulting in improved performance across accuracy, false positive rate, precision, and recall metrics.

Classification algorithms are unable to make reliable models on the datasets with huge sizes. These datasets contain many irrelevant and redundant features that mislead the classifiers. Furthermore, many huge datasets have imbalanced class distribution which leads to bias over majority class in the classification process. In this paper combination of unsupervised dimensionality reduction methods with resampling is proposed and the results are tested on Lung-Cancer dataset. In the first step PCA is applied on Lung-Cancer dataset to compact the dataset and eliminate irrelevant features and in the second step SMOTE resampling is carried out to balance the class distribution and increase the variety of sample domain. Finally, Naive Bayes classifier is applied on the resulting dataset and the results are compared and evaluation metrics are calculated. The experiments show the effectiveness of the proposed method across four evaluation metrics: Overall accuracy, False Positive Rate, Precision, Recall.

View on arXiv PDF

Similar