Generative Synthesis of Insurance Datasets
This addresses the problem of limited data access for actuarial researchers and developers, though it is incremental as it applies an existing method to a new domain.
The authors tackled the lack of realistic public insurance datasets by developing a workflow using CTGAN to synthesize data for general insurance pricing and life insurance shock lapse modeling, evaluating it based on machine learning efficacy, variable distributions, and parameter stability.
One of the impediments in advancing actuarial research and developing open source assets for insurance analytics is the lack of realistic publicly available datasets. In this work, we develop a workflow for synthesizing insurance datasets leveraging CTGAN, a recently proposed neural network architecture for generating tabular data. Applying the proposed workflow to publicly available data in the domains of general insurance pricing and life insurance shock lapse modeling, we evaluate the synthesized datasets from a few perspectives: machine learning efficacy, distributions of variables, and stability of model parameters. This workflow is implemented via an R interface to promote adoption by researchers and data owners.