LGCLDec 12, 2024

Multi-Scale Heterogeneous Text-Attributed Graph Datasets From Diverse Domains

arXiv:2412.08937v15 citationsh-index: 4WWW
Originality Synthesis-oriented
AI Analysis

This provides a resource for researchers working on graph learning with text attributes, though it is incremental as it focuses on dataset creation rather than new methods.

The authors tackled the lack of comprehensive datasets for Heterogeneous Text-Attributed Graphs (HTAGs) by introducing a collection of multi-scale, diverse benchmark datasets spanning domains like movies and patents, and conducted benchmark experiments with graph neural networks to enable realistic evaluation.

Heterogeneous Text-Attributed Graphs (HTAGs), where different types of entities are not only associated with texts but also connected by diverse relationships, have gained widespread popularity and application across various domains. However, current research on text-attributed graph learning predominantly focuses on homogeneous graphs, which feature a single node and edge type, thus leaving a gap in understanding how methods perform on HTAGs. One crucial reason is the lack of comprehensive HTAG datasets that offer original textual content and span multiple domains of varying sizes. To this end, we introduce a collection of challenging and diverse benchmark datasets for realistic and reproducible evaluation of machine learning models on HTAGs. Our HTAG datasets are multi-scale, span years in duration, and cover a wide range of domains, including movie, community question answering, academic, literature, and patent networks. We further conduct benchmark experiments on these datasets with various graph neural networks. All source data, dataset construction codes, processed HTAGs, data loaders, benchmark codes, and evaluation setup are publicly available at GitHub and Hugging Face.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes