LGSep 15, 2024

Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data

arXiv:2409.09792v11 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses data quality issues for financial risk prediction, but it is incremental as it builds on existing data pre-processing methods.

The study tackled the problem of poor machine learning performance in financial risk prediction due to imbalanced data by introducing TriEnhance, a data pre-processing technique that improved minority class calibration across six benchmark datasets.

In the financial risk domain, particularly in credit default prediction and fraud detection, accurate identification of high-risk class instances is paramount, as their occurrence can have significant economic implications. Although machine learning models have gained widespread adoption for risk prediction, their performance is often hindered by the scarcity and diversity of high-quality data. This limitation stems from factors in datasets such as small risk sample sizes, high labeling costs, and severe class imbalance, which impede the models' ability to learn effectively and accurately forecast critical events. This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels. Our experiments across six benchmark datasets reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes