LGJul 28, 2025

PROVCREATOR: Synthesizing Complex Heterogenous Graphs with Node and Edge Attributes

arXiv:2507.20967v12 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the problem of limited synthetic graph generation methods for real-world applications requiring semantic fidelity, such as cybersecurity and knowledge graphs, though it appears incremental as it adapts existing sequence generation techniques to graphs.

The researchers tackled the challenge of generating synthetic graphs with complex heterogeneous structures and high-dimensional attributes, introducing ProvCreator which uses transformer-based language models to achieve realistic graph synthesis. They validated their approach on cybersecurity provenance graphs and knowledge graphs, demonstrating it captures intricate dependencies and generates privacy-aware synthetic datasets.

The rise of graph-structured data has driven interest in graph learning and synthetic data generation. While successful in text and image domains, synthetic graph generation remains challenging -- especially for real-world graphs with complex, heterogeneous schemas. Existing research has focused mostly on homogeneous structures with simple attributes, limiting their usefulness and relevance for application domains requiring semantic fidelity. In this research, we introduce ProvCreator, a synthetic graph framework designed for complex heterogeneous graphs with high-dimensional node and edge attributes. ProvCreator formulates graph synthesis as a sequence generation task, enabling the use of transformer-based large language models. It features a versatile graph-to-sequence encoder-decoder that 1. losslessly encodes graph structure and attributes, 2. efficiently compresses large graphs for contextual modeling, and 3. supports end-to-end, learnable graph generation. To validate our research, we evaluate ProvCreator on two challenging domains: system provenance graphs in cybersecurity and knowledge graphs from IntelliGraph Benchmark Dataset. In both cases, ProvCreator captures intricate dependencies between structure and semantics, enabling the generation of realistic and privacy-aware synthetic datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes