AI LG SIOct 22, 2024

SaVe-TAG: Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs

Leyao Wang, Yu Wang, Bo Ni, Yuying Zhao, Hanyu Wang, Yao Ma, Tyler Derr

arXiv:2410.16882v34.21 citationsh-index: 10

Originality Highly original

AI Analysis

This work addresses class imbalance in text-attributed graphs for applications like social network analysis or recommendation systems, representing an incremental improvement by integrating semantic and structural signals.

The paper tackled the problem of long-tailed distributions in text-attributed graphs, which hinder Graph Neural Networks' generalization across classes, by proposing SaVe-TAG, a semantic-aware vicinal risk minimization framework that uses Large Language Models for text-level interpolation and a confidence-based edge assignment mechanism, resulting in consistent outperformance over numeric interpolation and prior baselines in experiments on benchmark datasets.

Real-world graph data often follows long-tailed distributions, making it difficult for Graph Neural Networks (GNNs) to generalize well across both head and tail classes. Recent advances in Vicinal Risk Minimization (VRM) have shown promise in mitigating class imbalance with numeric interpolation; however, existing approaches largely rely on embedding-space arithmetic, which fails to capture the rich semantics inherent in text-attributed graphs. In this work, we propose our method, SaVe-TAG (Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs), a novel VRM framework that leverages Large Language Models (LLMs) to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. To mitigate the risk of noisy generation, we introduce a confidence-based edge assignment mechanism that uses graph topology as a natural filter to ensure structural consistency. We provide theoretical justification for our method and conduct extensive experiments on benchmark datasets, showing that our approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines. Our results highlight the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs.

View on arXiv PDF

Similar