LGJun 2, 2023

Analyzing Credit Risk Model Problems through NLP-Based Clustering and Machine Learning: Insights from Validation Reports

Szymon Lis, Mariusz Kubkowski, Olimpia Borkowska, Dobromił Serwa, Jarosław Kurpanik

arXiv:2306.01618v11 citationsh-index: 13

Originality Synthesis-oriented

AI Analysis

It addresses the need for automated analysis of textual validation data in banking, though it is incremental in applying existing methods to a new domain-specific dataset.

This paper tackled the problem of identifying and classifying credit risk model issues by applying NLP-based clustering and machine learning to validation reports, achieving over 60% accuracy in clustering and 80% accuracy in prediction with XGBoost.

This paper explores the use of clustering methods and machine learning algorithms, including Natural Language Processing (NLP), to identify and classify problems identified in credit risk models through textual information contained in validation reports. Using a unique dataset of 657 findings raised by validation teams in a large international banking group between January 2019 and December 2022. The findings are classified into nine validation dimensions and assigned a severity level by validators using their expert knowledge. The authors use embedding generation for the findings' titles and observations using four different pre-trained models, including "module\_url" from TensorFlow Hub and three models from the SentenceTransformer library, namely "all-mpnet-base-v2", "all-MiniLM-L6-v2", and "paraphrase-mpnet-base-v2". The paper uses and compares various clustering methods in grouping findings with similar characteristics, enabling the identification of common problems within each validation dimension and severity. The results of the study show that clustering is an effective approach for identifying and classifying credit risk model problems with accuracy higher than 60\%. The authors also employ machine learning algorithms, including logistic regression and XGBoost, to predict the validation dimension and its severity, achieving an accuracy of 80\% for XGBoost algorithm. Furthermore, the study identifies the top 10 words that predict a validation dimension and severity. Overall, this paper makes a contribution by demonstrating the usefulness of clustering and machine learning for analyzing textual information in validation reports, and providing insights into the types of problems encountered in the development and validation of credit risk models.

View on arXiv PDF

Similar