Yurui Lai

h-index2

6papers

14citations

Novelty70%

AI Score43

Ranked #57,311 of 194,257 authors (top 30%)#13,031 in LG (top 32%)

6 Papers

7.9LGDec 16, 2024Code

Leveraging Large Language Models for Effective Label-free Node Classification in Text-Attributed Graphs

Taiyan Zhang, Renchi Yang, Yurui Lai et al.

Graph neural networks (GNNs) have become the preferred models for node classification in graph data due to their robust capabilities in integrating graph structures and attributes. However, these models heavily depend on a substantial amount of high-quality labeled data for training, which is often costly to obtain. With the rise of large language models (LLMs), a promising approach is to utilize their exceptional zero-shot capabilities and extensive knowledge for node labeling. Despite encouraging results, this approach either requires numerous queries to LLMs or suffers from reduced performance due to noisy labels generated by LLMs. To address these challenges, we introduce Locle, an active self-training framework that does Label-free node Classification with LLMs cost-Effectively. Locle iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs, serving as additional supervision signals to enhance model training. Specifically, Locle comprises three key components: (i) an effective active node selection strategy for initial annotations; (ii) a careful sample selection scheme to identify "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module that combines LLMs and GNNs with a rewired topology. Extensive experiments on five benchmark text-attributed graph datasets demonstrate that Locle significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification. Notably, on the DBLP dataset with 14.3k nodes, Locle achieves an 8.08% improvement in accuracy over the state-of-the-art at a cost of less than one cent. Our code is available at https://github.com/HKBU-LAGAS/Locle.

AIMay 27

Semi-Supervised Text-Attributed Graph Distillation

Yurui Lai, Samir Moustafa, Renchi Yang et al.

{\em Text-Attributed Graphs} (TAGs) have emerged as an expressive data model for integrating graph topology with rich textual semantics. Existing representation learning methods over TAGs suffer from severe scalability bottlenecks, particularly together with {\em Large Language Models} (LLMs). While data distillation offers a promising data-centric solution, existing methods fail to capture the complex interplay between graph and text modalities, struggle with the label scarcity inherent in semi-supervised settings, and lack the ability to produce the human-readable textual attributes required for downstream LLM-based tasks. To address these challenges, we propose \algo{}, a unified semi-supervised framework guided by the {\em Wasserstein Distance} (WSD). Grounded in our empirical findings on real TAGs, \algo{} introduces a graph-text collaborative encoding module that utilizes dual-pathway encoders (graph-aware and -free) within a collaborative self-training scheme to harvest reliable pseudo-labels and fuse complementary graph-text features. Furthermore, we develop a theoretically grounded WSD-based graph sketching algorithm and a cost-effective LLM text synthesis module, which leverages cluster-based keyword extraction to generate coherent, human-readable summaries for condensed nodes. Extensive experiments on benchmark datasets demonstrate that \algo{} achieves a state-of-the-art performance-compression trade-off in terms of both GNN- and LLM-based downstream tasks, enabling effective and efficient TAG learning or analytics.

11.4LGMay 27, 2025

Simple yet Effective Graph Distillation via Clustering

Yurui Lai, Taiyan Zhang, Renchi Yang

Despite plentiful successes achieved by graph representation learning in various domains, the training of graph neural networks (GNNs) still remains tenaciously challenging due to the tremendous computational overhead needed for sizable graphs in practice. Recently, graph data distillation (GDD), which seeks to distill large graphs into compact and informative ones, has emerged as a promising technique to enable efficient GNN training. However, most existing GDD works rely on heuristics that align model gradients or representation distributions on condensed and original graphs, leading to compromised result quality, expensive training for distilling large graphs, or both. Motivated by this, this paper presents an efficient and effective GDD approach, ClustGDD. Under the hood, ClustGDD resorts to synthesizing the condensed graph and node attributes through fast and theoretically-grounded clustering that minimizes the within-cluster sum of squares and maximizes the homophily on the original graph. The fundamental idea is inspired by our empirical and theoretical findings unveiling the connection between clustering and empirical condensation quality using Fréchet Inception Distance, a well-known quality metric for synthetic images. Furthermore, to mitigate the adverse effects caused by the homophily-based clustering, ClustGDD refines the nodal attributes of the condensed graph with a small augmentation learned via class-aware graph sampling and consistency loss. Our extensive experiments exhibit that GNNs trained over condensed graphs output by ClustGDD consistently achieve superior or comparable performance to state-of-the-art GDD methods in terms of node classification on five benchmark datasets, while being orders of magnitude faster.

10.4LGJun 8, 2024Code

Efficient Topology-aware Data Augmentation for High-Degree Graph Neural Networks

Yurui Lai, Xiaoyang Lin, Renchi Yang et al.

In recent years, graph neural networks (GNNs) have emerged as a potent tool for learning on graph-structured data and won fruitful successes in varied fields. The majority of GNNs follow the message-passing paradigm, where representations of each node are learned by recursively aggregating features of its neighbors. However, this mechanism brings severe over-smoothing and efficiency issues over high-degree graphs (HDGs), wherein most nodes have dozens (or even hundreds) of neighbors, such as social networks, transaction graphs, power grids, etc. Additionally, such graphs usually encompass rich and complex structure semantics, which are hard to capture merely by feature aggregations in GNNs. Motivated by the above limitations, we propose TADA, an efficient and effective front-mounted data augmentation framework for GNNs on HDGs. Under the hood, TADA includes two key modules: (i) feature expansion with structure embeddings, and (ii) topology- and attribute-aware graph sparsification. The former obtains augmented node features and enhanced model capacity by encoding the graph structure into high-quality structure embeddings with our highly-efficient sketching method. Further, by exploiting task-relevant features extracted from graph structures and attributes, the second module enables the accurate identification and reduction of numerous redundant/noisy edges from the input graph, thereby alleviating over-smoothing and facilitating faster feature aggregations over HDGs. Empirically, TADA considerably improves the predictive performance of mainstream GNN models on 8 real homophilic/heterophilic HDGs in terms of node classification, while achieving efficient training and inference processes.

2.0LGMay 28, 2023

Self-attention Dual Embedding for Graphs with Heterophily

Yurui Lai, Taiyan Zhang, Rui Fan

Graph Neural Networks (GNNs) have been highly successful for the node classification task. GNNs typically assume graphs are homophilic, i.e. neighboring nodes are likely to belong to the same class. However, a number of real-world graphs are heterophilic, and this leads to much lower classification accuracy using standard GNNs. In this work, we design a novel GNN which is effective for both heterophilic and homophilic graphs. Our work is based on three main observations. First, we show that node features and graph topology provide different amounts of informativeness in different graphs, and therefore they should be encoded independently and prioritized in an adaptive manner. Second, we show that allowing negative attention weights when propagating graph topology information improves accuracy. Finally, we show that asymmetric attention weights between nodes are helpful. We design a GNN which makes use of these observations through a novel self-attention mechanism. We evaluate our algorithm on real-world graphs containing thousands to millions of nodes and show that we achieve state-of-the-art results compared to existing GNNs. We also analyze the effectiveness of the main components of our design on different graphs.

2.6CVNov 21, 2021

Denoised Internal Models: a Brain-Inspired Autoencoder against Adversarial Attacks

Kaiyuan Liu, Xingyu Li, Yurui Lai et al.

Despite its great success, deep learning severely suffers from robustness; that is, deep neural networks are very vulnerable to adversarial attacks, even the simplest ones. Inspired by recent advances in brain science, we propose the Denoised Internal Models (DIM), a novel generative autoencoder-based model to tackle this challenge. Simulating the pipeline in the human brain for visual signal processing, DIM adopts a two-stage approach. In the first stage, DIM uses a denoiser to reduce the noise and the dimensions of inputs, reflecting the information pre-processing in the thalamus. Inspired from the sparse coding of memory-related traces in the primary visual cortex, the second stage produces a set of internal models, one for each category. We evaluate DIM over 42 adversarial attacks, showing that DIM effectively defenses against all the attacks and outperforms the SOTA on the overall robustness.