A Framework for Large Scale Synthetic Graph Dataset Generation
This addresses a bottleneck for researchers and developers in graph learning by providing synthetic data for prototyping and testing, though it is incremental as it builds on existing generation methods.
The paper tackles the shortage of large-scale public graph datasets by proposing a scalable synthetic graph generation tool that can produce graphs with trillions of edges and billions of nodes, learning from proprietary data to enable research and benchmarking.
Recently there has been increasing interest in developing and deploying deep graph learning algorithms for many tasks, such as fraud detection and recommender systems. Albeit, there is a limited number of publicly available graph-structured datasets, most of which are tiny compared to production-sized applications or are limited in their application domain. This work tackles this shortcoming by proposing a scalable synthetic graph generation tool to scale the datasets to production-size graphs with trillions of edges and billions of nodes. The tool learns a series of parametric models from proprietary datasets that can be released to researchers to study various graph methods on the synthetic data increasing prototype development and novel applications. We demonstrate the generalizability of the framework across a series of datasets, mimicking structural and feature distributions as well as the ability to scale them across varying sizes demonstrating their usefulness for benchmarking and model development. Code can be found on https://github.com/NVIDIA/DeepLearningExamples/tree/master/Tools/DGLPyTorch/SyntheticGraphGeneration.