Flow Matching for Tabular Data Synthesis
This work addresses privacy concerns in data sharing for domains like healthcare or finance by offering a more efficient and effective synthetic data generation method, though it is incremental as it builds on existing flow matching techniques.
The paper tackled the problem of generating synthetic tabular data for privacy-preserving sharing by comparing flow matching methods with diffusion baselines, finding that flow matching, especially TabbyFlow, outperforms in performance with fewer computational steps (≤100) and that using the Optimal Transport path yields superior results while Variance Preserving reduces disclosure risk.
Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.