Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)
This addresses a critical problem for business and science applications where synthetic table generation is essential, but it is incremental as it builds on existing LLM fine-tuning methods.
The paper tackled the problem of LLMs being inadequate for synthetic table generation due to their autoregressive nature and random order permutation during fine-tuning, which hampers modeling functional dependencies and capturing conditional mixtures of distributions. The result showed that making LLMs permutation-aware can mitigate these issues, though no concrete numbers were provided.
Synthetic data generation is integral to ML pipelines, e.g., to augment training data, replace sensitive information, and even to power advanced platforms like DeepSeek. While LLMs fine-tuned for synthetic data generation are gaining traction, synthetic table generation -- a critical data type in business and science -- remains under-explored compared to text and image synthesis. This paper shows that LLMs, whether used as-is or after traditional fine-tuning, are inadequate for generating synthetic tables. Their autoregressive nature, combined with random order permutation during fine-tuning, hampers the modeling of functional dependencies and prevents capturing conditional mixtures of distributions essential for real-world constraints. We demonstrate that making LLMs permutation-aware can mitigate these issues.