DB LGOct 31, 2024

DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room

Tung Sum Thomas Kwok, Chi-hua Wang, Guang Cheng

arXiv:2411.00879v13.32 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses privacy concerns in data clean rooms by enhancing synthetic data generation for data collaboration, though it appears incremental as it builds on existing multi-table synthesizers with new pre-processing and evaluation methods.

The paper tackles the problem of multi-table synthesizers failing when subjects repeat across tables, a common scenario in data collaboration, by introducing the DEREC pre-processing pipeline and SIMPRO evaluation metrics, resulting in improved synthetic data fidelity and showing multi-table synthesizers outperform single-table ones in collaboration settings.

Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.

View on arXiv PDF

Similar