RM LG SIDec 31, 2022

Assessment of creditworthiness models privacy-preserving training with synthetic data

Ricardo Muñoz-Cancino, Cristián Bravo, Sebastián A. Ríos, Manuel Graña

arXiv:2301.01212v11.23 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses privacy concerns in credit risk assessment for financial institutions, enabling research with synthetic data, though it is incremental in nature.

The study tackled the problem of evaluating credit scoring models trained on synthetic data to preserve borrower privacy, finding that models trained with synthetic data showed a 3% reduction in AUC and 6% reduction in KS compared to those trained on real data.

Credit scoring models are the primary instrument used by financial institutions to manage credit risk. The scarcity of research on behavioral scoring is due to the difficult data access. Financial institutions have to maintain the privacy and security of borrowers' information refrain them from collaborating in research initiatives. In this work, we present a methodology that allows us to evaluate the performance of models trained with synthetic data when they are applied to real-world data. Our results show that synthetic data quality is increasingly poor when the number of attributes increases. However, creditworthiness assessment models trained with synthetic data show a reduction of 3\% of AUC and 6\% of KS when compared with models trained with real data. These results have a significant impact since they encourage credit risk investigation from synthetic data, making it possible to maintain borrowers' privacy and to address problems that until now have been hampered by the availability of information.

View on arXiv PDF

Similar