LGFeb 3, 2023

VT-GAN: Cooperative Tabular Data Synthesis using Vertical Federated Learning

Zilong Zhao, Han Wu, Aad Van Moorsel, Lydia Y. Chen

arXiv:2302.01706v25.36 citationsh-index: 34

Originality Incremental advance

AI Analysis

This addresses privacy concerns for entities like financial institutions that hold disjoint tabular data features, enabling collaborative data synthesis without sharing raw data.

The paper tackles the problem of generating synthetic tabular data in a privacy-preserving manner by applying Vertical Federated Learning (VFL) to GANs, introducing the VT-GAN framework. It demonstrates that VT-GAN achieves performance close to centralized GANs, with differences in machine learning utility as low as 2.7% under various conditions.

This paper presents the application of Vertical Federated Learning (VFL) to generate synthetic tabular data using Generative Adversarial Networks (GANs). VFL is a collaborative approach to train machine learning models among distinct tabular data holders, such as financial institutions, who possess disjoint features for the same group of customers. In this paper we introduce the VT-GAN framework, Vertical federated Tabular GAN, and demonstrate that VFL can be successfully used to implement GANs for distributed tabular data in privacy-preserving manner, with performance close to centralized GANs that assume shared data. We make design choices with respect to the distribution of GAN generator and discriminator models and introduce a training-with-shuffling technique so that no party can reconstruct training data from the GAN conditional vector. The paper presents (1) an implementation of VT-GAN, (2) a detailed quality evaluation of the VT-GAN-generated synthetic data, (3) an overall scalability examination of VT-GAN framework, (4) a security analysis on VT-GAN's robustness against Membership Inference Attack with different settings of Differential Privacy, for a range of datasets with diverse distribution characteristics. Our results demonstrate that VT-GAN can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by a centralized GAN algorithm. The difference in machine learning utility can be as low as 2.7%, even under extremely imbalanced data distributions across clients or with different numbers of clients.

View on arXiv PDF

Similar