DB LGDec 23, 2023

IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding

Jiayu Li, Zilong Zhao, Vikram Chundawat, Biplab Sikdar, Y. C. Tay

arXiv:2312.15187v21.21 citationsHas Code

Originality Highly original

AI Analysis

This addresses the need for scalable and accurate synthetic data generation for applications like privacy-preserving sharing and software testing in corporate and institutional settings, representing a novel combination of features not previously achieved.

The paper tackles the problem of generating synthetic relational databases with complex structures, proposing the incremental relational generator (IRG) that preserves schema integrity and improves data fidelity and utility, as demonstrated on three real-life datasets.

Synthetic data has numerous applications, including but not limited to software testing at scale, privacy-preserving data sharing to enable smoother collaboration between stakeholders, and data augmentation for analytical and machine learning tasks. Relational databases, which are commonly used by corporations, governments, and financial institutions, present unique challenges for synthetic data generation due to their complex structures. Existing synthetic relational database generation approaches often assume idealized scenarios, such as every table having a perfect primary key column without composite and potentially overlapping primary or foreign key constraints, and fail to account for the sequential nature of certain tables. In this paper, we propose incremental relational generator (IRG), that successfully handles these ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep contextual understanding of relationships beyond direct ancestors and descendants, leverages the power of newly designed deep neural networks, and scales efficiently to handle larger datasets--a combination never achieved in previous works. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.

View on arXiv PDF

Similar