Towards Data-centric Graph Machine Learning: Review and Outlook
This is an incremental review that organizes existing work for researchers and practitioners in graph machine learning, focusing on data management rather than novel algorithmic breakthroughs.
The paper tackles the challenge of applying data-centric AI principles to graph data by introducing a systematic framework called Data-centric Graph Machine Learning (DC-GML), which reviews and organizes methods across the graph data lifecycle to address issues like data quality and availability.
Data-centric AI, with its primary focus on the collection, management, and utilization of data to drive AI models and applications, has attracted increasing attention in recent years. In this article, we conduct an in-depth and comprehensive review, offering a forward-looking outlook on the current efforts in data-centric AI pertaining to graph data-the fundamental data structure for representing and capturing intricate dependencies among massive and diverse real-life entities. We introduce a systematic framework, Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of the graph data lifecycle, including graph data collection, exploration, improvement, exploitation, and maintenance. A thorough taxonomy of each stage is presented to answer three critical graph-centric questions: (1) how to enhance graph data availability and quality; (2) how to learn from graph data with limited-availability and low-quality; (3) how to build graph MLOps systems from the graph data-centric view. Lastly, we pinpoint the future prospects of the DC-GML domain, providing insights to navigate its advancements and applications.