CLMay 26

Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning

Yunbo Long, Ge Zheng, Liming Xu, Alexandra Brintrup

arXiv:2605.2682380.4

Predicted impact top 62% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For supply chain analytics practitioners, TabKG provides a way to generate synthetic data that is both statistically realistic and operationally plausible, enabling reliable simulation and decision-making without compromising data privacy.

TabKG generates synthetic supply chain data that preserves operational logic (temporal orderings, mathematical dependencies, hierarchical taxonomies, conditional rules) by constructing a validated Column Relationship Knowledge Graph from column metadata using a multi-LLM ensemble, then using it to guide a latent diffusion model. The method enforces logical consistency by construction, addressing a key limitation of existing tabular generative models.

Synthetic data offers a promising solution to two persistent barriers in supply chain analytics: data scarcity and data privacy. However, for synthetic data to support operational simulation and decision-making, it must do more than reproduce the statistical distributions of real records, and also preserve the \emph{operational logic} that governs supply chain processes, including the temporal orderings, mathematical dependencies, hierarchical taxonomies, and conditional rules that make a record operationally plausible. We consider this logic as the ``physics'' of supply chain data. Existing tabular generative models are primarily optimized for distributional fidelity and downstream predictive utility, and therefore often generate records that appear statistically realistic but violate fundamental operational constraints. This paper introduces \textbf{\textit{TabKG}}, a knowledge-graph-guided framework for logically consistent synthetic supply chain tabular data generation. TabKG constructs a \textbf{\textit{Column Relationship Knowledge Graph (CR-KG)}} to represent data operational dependencies. It uses a multi-LLM ensemble with majority voting to propose candidate relationships from column metadata, validates these relationships against real data to remove hallucinated or unsupported edges, and then uses the validated CR-KG to guide generation. Specifically, TabKG compresses the original table into independent columns, generates these columns using a latent diffusion model, and deterministically reconstructs dependent columns according to the validated relationships, enforcing logical consistency by construction with respect to the discovered operational rules.

View on arXiv PDF

Similar