Prediction of motor insurance claims occurrence as an imbalanced machine learning problem
This work addresses the challenge of imbalanced data for insurance companies in claim prediction, but it is incremental as it applies existing methods without introducing new techniques.
The paper tackles the problem of predicting motor insurance claims using imbalanced datasets, applying various machine learning methods like logistic regression and random forest to compare their performance in this context.
The insurance industry, with its large datasets, is a natural place to use big data solutions. However it must be stressed, that significant number of applications for machine learning in insurance industry, like fraud detection or claim prediction, deals with the problem of machine learning on an imbalanced data set. This is due to the fact that frauds or claims are rare events when compared with the entire population of drivers. The problem of imbalanced learning is often hard to overcome. Therefore, the main goal of this work is to present and apply various methods of dealing with an imbalanced dataset in the context of claim occurrence prediction in car insurance. In addition, the above techniques are used to compare the results of machine learning algorithms in the context of claim occurrence prediction in car insurance. Our study covers the following techniques: logistic-regression, decision tree, random forest, xgBoost, feed-forward network. The problem is the classification one.