LGMar 3, 2025
Synthetic Tabular Data Detection In the WildG. Charbel N. Kindji, Elisa Fromont, Lina Maria Rojas-Barahona et al.
Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified across different tables. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose four table-agnostic detectors combined with simple preprocessing schemes that we evaluate on six evaluation protocols, with different levels of ''wildness''. Our results show that cross-table learning on a restricted set of tables is possible even with naive preprocessing schemes. They confirm however that cross-table transfer (i.e. deployment on a table that has not been seen before) is challenging. This suggests that sophisticated encoding schemes are required to handle this problem.
LGAug 27, 2025
Robust Detection of Synthetic Tabular Data under Schema VariabilityG. Charbel N. Kindji, Elisa Fromont, Lina Maria Rojas-Barahona et al.
The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data in the wild, where tables have variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is not only feasible, but can be done with high reliability.
LGApr 10, 2025
Datum-wise Transformer for Synthetic Tabular Data Detection in the WildG. Charbel N. Kindji, Elisa Fromont, Lina Maria Rojas-Barahona et al.
The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.
LGDec 17, 2024
Cross-table Synthetic Tabular Data DetectionG. Charbel N. Kindji, Lina Maria Rojas-Barahona, Elisa Fromont et al.
Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified ''in the wild''-meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of ''wildness''. Our very preliminary results confirm that cross-table adaptation is a challenging task.
LGJun 18, 2024
Tabular Data Generation Models: An In-Depth Survey and Performance Benchmarks with Extensive TuningG. Charbel N. Kindji, Lina Maria Rojas-Barahona, Elisa Fromont et al.
The ability to train generative models that produce realistic, safe and useful tabular data is essential for data privacy, imputation, oversampling, explainability or simulation. However, generating tabular data is not straightforward due to its heterogeneity, non-smooth distributions, complex dependencies and imbalanced categorical features. Although diverse methods have been proposed in the literature, there is a need for a unified evaluation, under the same conditions, on a variety of datasets. This study addresses this need by fully considering the optimization of: hyperparameters, feature encodings, and architectures. We investigate the impact of dataset-specific tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. These datasets vary in terms of size (an average of 80,000 rows), data types, and domains. We also propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost. Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusion-based models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget.