CL AI IR LGJun 17, 2024

UniGLM: Training One Unified Language Model for Text-Attributed Graph Embedding

Yi Fang, Dongzhe Fan, Sirui Ding, Ninghao Liu, Qiaoyu Tan

arXiv:2406.12052v29.119 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the limitation of existing methods that cannot generalize across various TAG scenarios, benefiting textual and relational knowledge systems and recommendation systems, though it is incremental as it builds on contrastive learning and language model fine-tuning.

The paper tackles the problem of representation learning on text-attributed graphs (TAGs) by introducing UniGLM, a unified language model that generalizes across different graph domains and scales, achieving state-of-the-art performance on 9 benchmark datasets in downstream tasks and transfer learning scenarios.

Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.

View on arXiv PDF Code

Similar