LGDec 10, 2024

Impact of Sampling Techniques and Data Leakage on XGBoost Performance in Credit Card Fraud Detection

arXiv:2412.07437v15 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses a critical data leakage issue for practitioners in financial security using machine learning for fraud detection, though it is incremental as it focuses on methodological refinement.

The study tackled the problem of data leakage in credit card fraud detection by comparing XGBoost performance with sampling techniques applied before and after train-test splits, finding that pre-split sampling artificially inflated metrics while post-split sampling preserved evaluation integrity with lower results.

Credit card fraud detection remains a critical challenge in financial security, with machine learning models like XGBoost(eXtreme gradient boosting) emerging as powerful tools for identifying fraudulent transactions. However, the inherent class imbalance in credit card transaction datasets poses significant challenges for model performance. Although sampling techniques are commonly used to address this imbalance, their implementation sometimes precedes the train-test split, potentially introducing data leakage. This study presents a comparative analysis of XGBoost's performance in credit card fraud detection under three scenarios: Firstly without any imbalance handling techniques, secondly with sampling techniques applied only to the training set after the train-test split, and third with sampling techniques applied before the train-test split. We utilized a dataset from Kaggle of 284,807 credit card transactions, containing 0.172\% fraudulent cases, to evaluate these approaches. Our findings show that although sampling strategies enhance model performance, the reliability of results is greatly impacted by when they are applied. Due to a data leakage issue that frequently occurs in machine learning models during the sampling phase, XGBoost models trained on data where sampling was applied prior to the train-test split may have displayed artificially inflated performance metrics. Surprisingly, models trained with sampling techniques applied solely to the training set demonstrated significantly lower results than those with pre-split sampling, all the while preserving the integrity of the evaluation process.

View on arXiv PDF

Similar