LG APSep 23, 2025

Enhancing Credit Default Prediction Using Boruta Feature Selection and DBSCAN Algorithm with Different Resampling Techniques

Obu-Amoah Ampomah, Edmund Agyemang, Kofi Acheampong, Louis Agyekum

arXiv:2509.19408v11 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

This work addresses the class imbalance problem in credit default prediction for financial institutions, but it is incremental as it combines existing methods without introducing new paradigms.

This study tackled credit default prediction by comparing resampling techniques and machine learning models, finding that a combination of Boruta feature selection, DBSCAN outlier detection, SMOTE-Tomek resampling, and Gradient Boosting Machine achieved the best performance with an F1-score of 82.56% and ROC-AUC of 90.90%.

This study examines credit default prediction by comparing three techniques, namely SMOTE, SMOTE-Tomek, and ADASYN, that are commonly used to address the class imbalance problem in credit default situations. Recognizing that credit default datasets are typically skewed, with defaulters comprising a much smaller proportion than non-defaulters, we began our analysis by evaluating machine learning (ML) models on the imbalanced data without any resampling to establish baseline performance. These baseline results provide a reference point for understanding the impact of subsequent balancing methods. In addition to traditional classifiers such as Naive Bayes and K-Nearest Neighbors (KNN), our study also explores the suitability of advanced ensemble boosting algorithms, including Extreme Gradient Boosting (XGBoost), AdaBoost, Gradient Boosting Machines (GBM), and Light GBM for credit default prediction using Boruta feature selection and DBSCAN-based outlier detection, both before and after resampling. A real-world credit default data set sourced from the University of Cleveland ML Repository was used to build ML classifiers, and their performances were tested. The criteria chosen to measure model performance are the area under the receiver operating characteristic curve (ROC-AUC), area under the precision-recall curve (PR-AUC), G-mean, and F1-scores. The results from this empirical study indicate that the Boruta+DBSCAN+SMOTE-Tomek+GBM classifier outperformed the other ML models (F1-score: 82.56%, G-mean: 82.98%, ROC-AUC: 90.90%, PR-AUC: 91.85%) in a credit default context. The findings establish a foundation for future progress in creating more resilient and adaptive credit default systems, which will be essential as credit-based transactions continue to rise worldwide.

View on arXiv PDF

Similar