Wei Jin

LG
h-index42
78papers
5,278citations
Novelty49%
AI Score63

78 Papers

LGJun 15, 2022Code
Condensing Graphs via One-Step Gradient Matching

Wei Jin, Xianfeng Tang, Haoming Jiang et al. · amazon-science, berkeley

As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance and our method is significantly faster than multi-step gradient matching (e.g. 15x in CIFAR10 for synthesizing 500 graphs). Code is available at \url{https://github.com/amazon-research/DosCond}.

LGJul 7, 2023Code
Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

Zhikai Chen, Haitao Mao, Hang Li et al.

Learning on Graphs has attracted immense attention due to its wide real-world applications. The most popular pipeline for learning on graphs with textual node attributes primarily relies on Graph Neural Networks (GNNs), and utilizes shallow text embedding as initial node representations, which has limitations in general knowledge and profound semantic understanding. In recent years, Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities that have revolutionized existing workflows to handle text data. In this paper, we aim to explore the potential of LLMs in graph machine learning, especially the node classification task, and investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. The former leverages LLMs to enhance nodes' text attributes with their massive knowledge and then generate predictions through GNNs. The latter attempts to directly employ LLMs as standalone predictors. We conduct comprehensive and systematical studies on these two pipelines under various settings. From comprehensive empirical results, we make original observations and find new insights that open new possibilities and suggest promising directions to leverage LLMs for learning on graphs. Our codes and datasets are available at https://github.com/CurryTang/Graph-LLM.

LGOct 7, 2022Code
Empowering Graph Representation Learning with Test-Time Graph Transformation

Wei Jin, Tong Zhao, Jiayuan Ding et al.

As powerful tools for representation learning on graphs, graph neural networks (GNNs) have facilitated various applications from drug discovery to recommender systems. Nevertheless, the effectiveness of GNNs is immensely challenged by issues related to data quality, such as distribution shift, abnormal features and adversarial attacks. Recent efforts have been made on tackling these issues from a modeling perspective which requires additional cost of changing model architectures or re-training model parameters. In this work, we provide a data-centric view to tackle these issues and propose a graph transformation framework named GTrans which adapts and refines graph data at test time to achieve better performance. We provide theoretical analysis on the design of the framework and discuss why adapting graph data works better than adapting the model. Extensive experiments have demonstrated the effectiveness of GTrans on three distinct scenarios for eight benchmark datasets where suboptimal data is presented. Remarkably, GTrans performs the best in most cases with improvements up to 2.8%, 8.2% and 3.8% over the best baselines on three experimental settings. Code is released at https://github.com/ChandlerBang/GTrans.

LGMar 3, 2022Code
Graph Neural Networks for Multimodal Single-Cell Data Integration

Hongzhi Wen, Jiayuan Ding, Wei Jin et al.

Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single-modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: $\textit{modality prediction}$, $\textit{modality matching}$ and $\textit{joint embedding}$. In this work, we present a general Graph Neural Network framework $\textit{scMoGNN}$ to tackle these three tasks and show that $\textit{scMoGNN}$ demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking of $\textit{Modality prediction}$ from NeurIPS 2021 Competition, and all implementations of our methods have been integrated into DANCE package~\url{https://github.com/OmicsML/dance}.

LGAug 30, 2022Code
Learning Representations for Hyper-Relational Knowledge Graphs

Harry Shomer, Wei Jin, Juanhui Li et al.

Knowledge graphs (KGs) have gained prominence for their ability to learn representations for uni-relational facts. Recently, research has focused on modeling hyper-relational facts, which move beyond the restriction of uni-relational facts and allow us to represent more complex and real-world information. However, existing approaches for learning representations on hyper-relational KGs majorly focus on enhancing the communication from qualifiers to base triples while overlooking the flow of information from base triple to qualifiers. This can lead to suboptimal qualifier representations, especially when a large amount of qualifiers are presented. It motivates us to design a framework that utilizes multiple aggregators to learn representations for hyper-relational facts: one from the perspective of the base triple and the other one from the perspective of the qualifiers. Experiments demonstrate the effectiveness of our framework for hyper-relational knowledge graph completion across multiple datasets. Furthermore, we conduct an ablation study that validates the importance of the various components in our framework. The code to reproduce our results can be found at \url{https://github.com/HarryShomer/QUAD}.

LGJun 1Code
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Zewen Liu, Zhan Shi, Yisi Sang et al.

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in https://github.com/A-EVO-Lab/AdaptiveHarness .

CLNov 1, 2023Code
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models

Ran Xu, Hejie Cui, Yue Yu et al. · gatech

Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. Our code is available at \url{https://github.com/ritaranx/ClinGen}.

AIJun 3
EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

Yiming Lu, Sihang Zeng, Zhengxu Tang et al.

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

LGJun 15, 2022
Feature Overcorrelation in Deep Graph Neural Networks: A New Perspective

Wei Jin, Xiaorui Liu, Yao Ma et al.

Recent years have witnessed remarkable success achieved by graph neural networks (GNNs) in many real-world applications such as recommendation and drug discovery. Despite the success, oversmoothing has been identified as one of the key issues which limit the performance of deep GNNs. It indicates that the learned node representations are highly indistinguishable due to the stacked aggregators. In this paper, we propose a new perspective to look at the performance degradation of deep GNNs, i.e., feature overcorrelation. Through empirical and theoretical study on this matter, we demonstrate the existence of feature overcorrelation in deeper GNNs and reveal potential reasons leading to this issue. To reduce the feature correlation, we propose a general framework DeCorr which can encourage GNNs to encode less redundant information. Extensive experiments have demonstrated that DeCorr can help enable deeper GNNs and is complementary to existing techniques tackling the oversmoothing issue.

IRJul 19, 2023
Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

Wei Jin, Haitao Mao, Zheng Li et al.

Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website https://kddcup23.github.io/.

LGJun 2, 2023
Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?

Haitao Mao, Zhikai Chen, Wei Jin et al.

Recent studies on Graph Neural Networks(GNNs) provide both empirical and theoretical evidence supporting their effectiveness in capturing structural patterns on both homophilic and certain heterophilic graphs. Notably, most real-world homophilic and heterophilic graphs are comprised of a mixture of nodes in both homophilic and heterophilic structural patterns, exhibiting a structural disparity. However, the analysis of GNN performance with respect to nodes exhibiting different structural patterns, e.g., homophilic nodes in heterophilic graphs, remains rather limited. In the present study, we provide evidence that Graph Neural Networks(GNNs) on node classification typically perform admirably on homophilic nodes within homophilic graphs and heterophilic nodes within heterophilic graphs while struggling on the opposite node set, exhibiting a performance disparity. We theoretically and empirically identify effects of GNNs on testing nodes exhibiting distinct structural patterns. We then propose a rigorous, non-i.i.d PAC-Bayesian generalization bound for GNNs, revealing reasons for the performance disparity, namely the aggregated feature distance and homophily ratio difference between training and testing nodes. Furthermore, we demonstrate the practical implications of our new findings via (1) elucidating the effectiveness of deeper GNNs; and (2) revealing an over-looked distribution shift factor on graph out-of-distribution problem and proposing a new scenario accordingly.

LGFeb 10, 2023
Toward Degree Bias in Embedding-Based Knowledge Graph Completion

Harry Shomer, Wei Jin, Wentao Wang et al.

A fundamental task for knowledge graphs (KGs) is knowledge graph completion (KGC). It aims to predict unseen edges by learning representations for all the entities and relations in a KG. A common concern when learning representations on traditional graphs is degree bias. It can affect graph algorithms by learning poor representations for lower-degree nodes, often leading to low performance on such nodes. However, there has been limited research on whether there exists degree bias for embedding-based KGC and how such bias affects the performance of KGC. In this paper, we validate the existence of degree bias in embedding-based KGC and identify the key factor to degree bias. We then introduce a novel data augmentation method, KG-Mixup, to generate synthetic triples to mitigate such bias. Extensive experiments have demonstrated that our method can improve various embedding-based KGC methods and outperform other methods tackling the bias problem on multiple benchmark datasets.

QMOct 22, 2022
Deep Learning in Single-Cell Analysis

Dylan Molho, Jiayuan Ding, Zhaoheng Li et al.

Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.

LGOct 17, 2022
Test-Time Training for Graph Neural Networks

Yiqi Wang, Chaozhuo Li, Wei Jin et al.

Graph Neural Networks (GNNs) have made tremendous progress in the graph classification task. However, a performance gap between the training set and the test set has often been noticed. To bridge such gap, in this work we introduce the first test-time training framework for GNNs to enhance the model generalization capacity for the graph classification task. In particular, we design a novel test-time training strategy with self-supervised learning to adjust the GNN model for each test graph sample. Experiments on the benchmark datasets have demonstrated the effectiveness of the proposed framework, especially when there are distribution shifts between training set and test set. We have also conducted exploratory studies and theoretical analysis to gain deeper understandings on the rationality of the design of the proposed graph test time training framework (GT3).

GNFeb 6, 2023
Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Imputation

Hongzhi Wen, Wenzhuo Tang, Wei Jin et al.

Spatially resolved transcriptomics brings exciting breakthroughs to single-cell analysis by providing physical locations along with gene expression. However, as a cost of the extremely high spatial resolution, the cellular level spatial transcriptomic data suffer significantly from missing values. While a standard solution is to perform imputation on the missing values, most existing methods either overlook spatial information or only incorporate localized spatial context without the ability to capture long-range spatial information. Using multi-head self-attention mechanisms and positional encoding, transformer models can readily grasp the relationship between tokens and encode location information. In this paper, by treating single cells as spatial tokens, we study how to leverage transformers to facilitate spatial tanscriptomics imputation. In particular, investigate the following two key questions: (1) $\textit{how to encode spatial information of cells in transformers}$, and (2) $\textit{ how to train a transformer for transcriptomic imputation}$. By answering these two questions, we present a transformer-based imputation framework, SpaFormer, for cellular-level spatial transcriptomic data. Extensive experiments demonstrate that SpaFormer outperforms existing state-of-the-art imputation algorithms on three large-scale datasets while maintaining superior computational efficiency.

GNMar 1, 2023
Single-Cell Multimodal Prediction via Transformers

Wenzhuo Tang, Hongzhi Wen, Renming Liu et al.

The recent development of multimodal single-cell technology has made the possibility of acquiring multiple omics data from individual cells, thereby enabling a deeper understanding of cellular states and dynamics. Nevertheless, the proliferation of multimodal single-cell data also introduces tremendous challenges in modeling the complex interactions among different modalities. The recently advanced methods focus on constructing static interaction graphs and applying graph neural networks (GNNs) to learn from multimodal data. However, such static graphs can be suboptimal as they do not take advantage of the downstream task information; meanwhile GNNs also have some inherent limitations when deeply stacking GNN layers. To tackle these issues, in this work, we investigate how to leverage transformers for multimodal single-cell data in an end-to-end manner while exploiting downstream task information. In particular, we propose a scMoFormer framework which can readily incorporate external domain knowledge and model the interactions within each modality and cross modalities. Extensive experiments demonstrate that scMoFormer achieves superior performance on various benchmark datasets. Remarkably, scMoFormer won a Kaggle silver medal with the rank of 24/1221 (Top 2%) without ensemble in a NeurIPS 2022 competition. Our implementation is publicly available at Github.

LGSep 28, 2024Code
Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph

Guancheng Wan, Zewen Liu, Max S. Y. Lau et al.

Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deep-learning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code will be available at https://github.com/Emory-Melody/EpiLearn.

LGMay 29
Learning Cardiac Latent Representations in Vectorcardiogram Space

Bosong Huang, Panzhen Zhao, Zengxiang Li et al.

Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tasks ranging from disease diagnosis to clinical report generation. However, existing methods operate almost exclusively in the observable ECG signal space. In practice, the standard twelve-lead ECG represents multiple projections of the same underlying cardiac electrical activity from different spatial orientations. Therefore, representation learning in the ECG space inevitably introduces substantial redundancy, which may lead to spurious correlations and increased risk of overfitting. To address this and motivated by the Frank vectorcardiogram (VCG) model, we propose learning a unified latent representation of cardiac electrical activity directly in the VCG space. We introduce LVCG, the first general self-supervised representation learning framework designed to operate in this physically grounded latent space. By learning view-invariant latent VCG representations rather than lead-specific artifacts, VCG minimizes redundancy and improves generalization. LVCG generally outperforms ECG-space baselines across tasks, demonstrating enhanced robustness and generalization, especially in domain shift settings.

LGOct 7, 2023
Label-free Node Classification on Graphs with Large Language Models (LLMS)

Zhikai Chen, Haitao Mao, Hongzhi Wen et al.

In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.

LGMar 25Code
TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness

Zhiyuan Zhao, Juntong Ni, Shangqing Xu et al.

Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TimeRecipe that recommends suitable model architectures based on these empirical insights. The benchmark is available at: https://github.com/AdityaLab/TimeRecipe.

LGJul 16, 2024
Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness

Kai Guo, Zewen Liu, Zhikai Chen et al.

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks. Recently, several LLMs-based pipelines have been developed to enhance learning on graphs with text attributes, showcasing promising performance. However, graphs are well-known to be susceptible to adversarial attacks and it remains unclear whether LLMs exhibit robustness in learning on graphs. To address this gap, our work aims to explore the potential of LLMs in the context of adversarial attacks on graphs. Specifically, we investigate the robustness against graph structural and textual perturbations in terms of two dimensions: LLMs-as-Enhancers and LLMs-as-Predictors. Through extensive experiments, we find that, compared to shallow models, both LLMs-as-Enhancers and LLMs-as-Predictors offer superior robustness against structural and textual attacks.Based on these findings, we carried out additional analyses to investigate the underlying causes. Furthermore, we have made our benchmark library openly available to facilitate quick and fair evaluations, and to encourage ongoing innovative research in this field.

LGJun 18, 2023
Globally Interpretable Graph Learning via Distribution Matching

Yi Nian, Yurui Chang, Wei Jin et al.

Graph neural networks (GNNs) have emerged as a powerful model to capture critical graph patterns. Instead of treating them as black boxes in an end-to-end fashion, attempts are arising to explain the model behavior. Existing works mainly focus on local interpretation to reveal the discriminative pattern for each individual instance, which however cannot directly reflect the high-level model behavior across instances. To gain global insights, we aim to answer an important question that is not yet well studied: how to provide a global interpretation for the graph learning procedure? We formulate this problem as globally interpretable graph learning, which targets on distilling high-level and human-intelligible patterns that dominate the learning procedure, such that training on this pattern can recover a similar model. As a start, we propose a novel model fidelity metric, tailored for evaluating the fidelity of the resulting model trained on interpretations. Our preliminary analysis shows that interpretative patterns generated by existing global methods fail to recover the model training procedure. Thus, we further propose our solution, Graph Distribution Matching (GDM), which synthesizes interpretive graphs by matching the distribution of the original and interpretive graphs in the GNN's feature space as its training proceeds, thus capturing the most informative patterns the model learns during training. Extensive experiments on graph classification datasets demonstrate multiple advantages of the proposed method, including high model fidelity, predictive accuracy and time efficiency, as well as the ability to reveal class-relevant structure.

CLMar 18Code
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei, Dehai Min, Zewen Liu et al.

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fourteen models spanning open-source and proprietary systems reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction.

LGSep 13, 2024
Sub-graph Based Diffusion Model for Link Prediction

Hang Li, Wei Jin, Geri Skenderi et al.

Denoising Diffusion Probabilistic Models (DDPMs) represent a contemporary class of generative models with exceptional qualities in both synthesis and maximizing the data likelihood. These models work by traversing a forward Markov Chain where data is perturbed, followed by a reverse process where a neural network learns to undo the perturbations and recover the original data. There have been increasing efforts exploring the applications of DDPMs in the graph domain. However, most of them have focused on the generative perspective. In this paper, we aim to build a novel generative model for link prediction. In particular, we treat link prediction between a pair of nodes as a conditional likelihood estimation of its enclosing sub-graph. With a dedicated design to decompose the likelihood estimation process via the Bayesian formula, we are able to separate the estimation of sub-graph structure and its node features. Such designs allow our model to simultaneously enjoy the advantages of inductive learning and the strong generalization capability. Remarkably, comprehensive experiments across various datasets validate that our proposed method presents numerous advantages: (1) transferability across datasets without retraining, (2) promising generalization on limited training data, and (3) robustness against graph adversarial attacks.

SIJan 29, 2024Code
A Comprehensive Survey on Graph Reduction: Sparsification, Coarsening, and Condensation

Mohammad Hashemi, Shengbo Gong, Juntong Ni et al.

Many real-world datasets can be naturally represented as graphs, spanning a wide range of domains. However, the increasing complexity and size of graph datasets present significant challenges for analysis and computation. In response, graph reduction, or graph summarization, has gained prominence for simplifying large graphs while preserving essential properties. In this survey, we aim to provide a comprehensive understanding of graph reduction methods, including graph sparsification, graph coarsening, and graph condensation. Specifically, we establish a unified definition for these methods and introduce a hierarchical taxonomy to categorize the challenges they address. Our survey then systematically reviews the technical details of these methods and emphasizes their practical applications across diverse scenarios. Furthermore, we outline critical research directions to ensure the continued effectiveness of graph reduction techniques, as well as provide a comprehensive paper list at \url{https://github.com/Emory-Melody/awesome-graph-reduction}. We hope this survey will bridge literature gaps and propel the advancement of this promising field.

LGMar 28, 2024Code
A Review of Graph Neural Networks in Epidemic Modeling

Zewen Liu, Guancheng Wan, B. Aditya Prakash et al.

Since the onset of the COVID-19 pandemic, there has been a growing interest in studying epidemiological models. Traditional mechanistic models mathematically describe the transmission mechanisms of infectious diseases. However, they often suffer from limitations of oversimplified or fixed assumptions, which could cause sub-optimal predictive power and inefficiency in capturing complex relation information. Consequently, Graph Neural Networks(GNNs) have emerged as a progressively popular tool in epidemic research. In this paper, we endeavor to furnish a comprehensive review of GNNs in epidemic tasks and highlight potential future directions. To accomplish this objective, we introduce hierarchical taxonomies for both epidemic tasks and methodologies, offering a trajectory of development within this domain. For epidemic tasks, we establish a taxonomy akin to those typically employed within the epidemic domain. For methodology, we categorize existing work into Neural Models and Hybrid Models. Following this, we perform an exhaustive and systematic examination of the methodologies, encompassing both the tasks and their technical details. Furthermore, we discuss the limitations of existing methods from diverse perspectives and systematically propose future research directions. This survey aims to bridge literature gaps and promote the progression of this promising field, with a list of relevant papers at https://github.com/Emory-Melody/awesome-epidemic-modeling-papers. We hope that it will facilitate synergies between the communities of GNNs and epidemiology, and contribute to their collective progress.

LGFeb 7, 2024Code
Navigating Complexity: Toward Lossless Graph Condensation via Expanding Window Matching

Yuchen Zhang, Tianle Zhang, Kai Wang et al. · pku

Graph condensation aims to reduce the size of a large-scale graph dataset by synthesizing a compact counterpart without sacrificing the performance of Graph Neural Networks (GNNs) trained on it, which has shed light on reducing the computational cost for training GNNs. Nevertheless, existing methods often fall short of accurately replicating the original graph for certain datasets, thereby failing to achieve the objective of lossless condensation. To understand this phenomenon, we investigate the potential reasons and reveal that the previous state-of-the-art trajectory matching method provides biased and restricted supervision signals from the original graph when optimizing the condensed one. This significantly limits both the scale and efficacy of the condensed graph. In this paper, we make the first attempt toward \textit{lossless graph condensation} by bridging the previously neglected supervision signals. Specifically, we employ a curriculum learning strategy to train expert trajectories with more diverse supervision signals from the original graph, and then effectively transfer the information into the condensed graph with expanding window matching. Moreover, we design a loss function to further extract knowledge from the expert trajectories. Theoretical analysis justifies the design of our method and extensive experiments verify its superiority across different datasets. Code is released at https://github.com/NUS-HPC-AI-Lab/GEOM.

LGOct 20, 2023
Spectral-Aware Augmentation for Enhanced Graph Representation Learning

Kaiqi Yang, Haoyu Han, Wei Jin et al.

Graph Contrastive Learning (GCL) has demonstrated remarkable effectiveness in learning representations on graphs in recent years. To generate ideal augmentation views, the augmentation generation methods should preserve essential information while discarding less relevant details for downstream tasks. However, current augmentation methods usually involve random topology corruption in the spatial domain, which fails to adequately address information spread across different frequencies in the spectral domain. Our preliminary study highlights this issue, demonstrating that spatial random perturbations impact all frequency bands almost uniformly. Given that task-relevant information typically resides in specific spectral regions that vary across graphs, this one-size-fits-all approach can pose challenges. We argue that indiscriminate spatial random perturbation might unintentionally weaken task-relevant information, reducing its effectiveness. To tackle this challenge, we propose applying perturbations selectively, focusing on information specific to different frequencies across diverse graphs. In this paper, we present GASSER, a model that applies tailored perturbations to specific frequencies of graph structures in the spectral domain, guided by spectral hints. Through extensive experimentation and theoretical analysis, we demonstrate that the augmentation views generated by GASSER are adaptive, controllable, and intuitively aligned with the homophily ratios and spectrum of graph structures.

CLApr 20
STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Juntong Ni, Shiyu Wang, Qi He et al.

Spatio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting, developed via a network SDE-based multi-agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S-GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real-world data.

LGMar 29, 2025Code
Graph ODEs and Beyond: A Comprehensive Survey on Integrating Differential Equations with Graph Neural Networks

Zewen Liu, Xiaoda Wang, Bohan Wang et al.

Graph Neural Networks (GNNs) and differential equations (DEs) are two rapidly advancing areas of research that have shown remarkable synergy in recent years. GNNs have emerged as powerful tools for learning on graph-structured data, while differential equations provide a principled framework for modeling continuous dynamics across time and space. The intersection of these fields has led to innovative approaches that leverage the strengths of both, enabling applications in physics-informed learning, spatiotemporal modeling, and scientific computing. This survey aims to provide a comprehensive overview of the burgeoning research at the intersection of GNNs and DEs. We will categorize existing methods, discuss their underlying principles, and highlight their applications across domains such as molecular modeling, traffic prediction, and epidemic spreading. Furthermore, we identify open challenges and outline future research directions to advance this interdisciplinary field. A comprehensive paper list is provided at https://github.com/Emory-Melody/Awesome-Graph-NDEs. This survey serves as a resource for researchers and practitioners seeking to understand and contribute to the fusion of GNNs and DEs

HCMar 30, 2023
Estimating Continuous Muscle Fatigue For Multi-Muscle Coordinated Exercise: A Pilot Study on Walking

Chunzhi Yi, Xiaolei Sun, Chunyu Zhang et al.

Assessing the progression of muscle fatigue for daily exercises provides vital indicators for precise rehabilitation, personalized training dose, especially under the context of Metaverse. Assessing fatigue of multi-muscle coordination-involved daily exercises requires the neuromuscular features that represent the fatigue-induced characteristics of spatiotemporal adaptions of multiple muscles and the estimator that captures the time-evolving progression of fatigue. In this paper, we propose to depict fatigue by the features of muscle compensation and spinal module activation changes and estimate continuous fatigue by a physiological rationale model. First, we extract muscle synergy fractionation and the variance of spinal module spikings as features inspired by the prior of fatigue-induced neuromuscular adaptations. Second, we treat the features as observations and develop a Bayesian Gaussian process to capture the time-evolving progression. Third, we solve the issue of lacking supervision information by mathematically formulating the time-evolving characteristics of fatigue as the loss function. Finally, we adapt the metrics that follow the physiological principles of fatigue to quantitatively evaluate the performance. Our extensive experiments present a 0.99 similarity between days, a over 0.7 similarity with other views of fatigue and a nearly 1 weak monotonicity, which outperform other methods. This study would aim the objective assessment of muscle fatigue.

CYApr 27, 2022
Identifying Critical LMS Features for Predicting At-risk Students

Ying Guo, Cengiz Gunay, Sairam Tangirala et al.

Learning management systems (LMSs) have become essential in higher education and play an important role in helping educational institutions to promote student success. Traditionally, LMSs have been used by postsecondary institutions in administration, reporting, and delivery of educational content. In this paper, we present an additional use of LMS by using its data logs to perform data-analytics and identify academically at-risk students. The data-driven insights would allow educational institutions and educators to develop and implement pedagogical interventions targeting academically at-risk students. We used anonymized data logs created by Brightspace LMS during fall 2019, spring 2020, and fall 2020 semesters at our college. Supervised machine learning algorithms were used to predict the final course performance of students, and several algorithms were found to perform well with accuracy above 90%. SHAP value method was used to assess the relative importance of features used in the predictive models. Unsupervised learning was also used to group students into different clusters based on the similarities in their interaction/involvement with LMS. In both of supervised and unsupervised learning, we identified two most-important features (Number_Of_Assignment_Submissions and Content_Completed). More importantly, our study lays a foundation and provides a framework for developing a real-time data analytics metric that may be incorporated into a LMS.

LGFeb 5, 2025Code
Pre-training Epidemic Time Series Forecasters with Compartmental Prototypes

Zewen Liu, Juntong Ni, Max S. Y. Lau et al.

Accurate epidemic forecasting is crucial for outbreak preparedness, but existing data-driven models are often brittle. Typically trained on a single pathogen, they struggle with data scarcity during new outbreaks and fail under distribution shifts caused by viral evolution or interventions. However, decades of surveillance data from diverse diseases offer an untapped source of transferable knowledge. To leverage the collective lessons from history, we propose CAPE, the first open-source pre-trained model for epidemic forecasting. Unlike existing time series foundation models that overlook epidemiological challenges, CAPE models epidemic dynamics as mixtures of latent population states, termed compartmental prototypes. It discovers a flexible dictionary of compartment prototypes directly from surveillance data, enabling each outbreak to be expressed as a time-varying mixture that links observed infections to latent population states. To promote robust generalization, CAPE combines self-supervised pre-training objectives with lightweight epidemic-aware regularizers that align the learned prototypes with epidemiological semantics. On a comprehensive benchmark spanning 17 diseases and 50+ regions, CAPE significantly outperforms strong baselines in zero-shot, few-shot, and full-shot forecasting. This work represents a principled step toward pre-trained epidemic models that are both transferable and epidemiologically grounded.

IRFeb 14, 2024Code
Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation

Juanhui Li, Haoyu Han, Zhikai Chen et al.

Session-based recommendation has gained increasing attention in recent years, with its aim to offer tailored suggestions based on users' historical behaviors within sessions. To advance this field, a variety of methods have been developed, with ID-based approaches typically demonstrating promising performance. However, these methods often face challenges with long-tail items and overlook other rich forms of information, notably valuable textual semantic information. To integrate text information, various methods have been introduced, mostly following a naive fusion framework. Surprisingly, we observe that fusing these two modalities does not consistently outperform the best single modality by following the naive fusion framework. Further investigation reveals an potential imbalance issue in naive fusion, where the ID dominates and text modality is undertrained. This suggests that the unexpected observation may stem from naive fusion's failure to effectively balance the two modalities, often over-relying on the stronger ID modality. This insight suggests that naive fusion might not be as effective in combining ID and text as previously expected. To address this, we propose a novel alternative training strategy AlterRec. It separates the training of ID and text, thereby avoiding the imbalance issue seen in naive fusion. Additionally, AlterRec designs a novel strategy to facilitate the interaction between the two modalities, enabling them to mutually learn from each other and integrate the text more effectively. Comprehensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation. The implementation is available at https://github.com/Juanhui28/AlterRec.

LGJun 24, 2024Code
GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights

Shengbo Gong, Juntong Ni, Noveen Sachdeva et al.

Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications like neural architecture search and deepens our understanding of redundancies in large graphs. Despite the rapid development of GC methods, particularly for node classification, a unified evaluation framework is still lacking to systematically compare different GC methods or clarify key design choices for improving their effectiveness. To bridge these gaps, we introduce \textbf{GC4NC}, a comprehensive framework for evaluating diverse GC methods on node classification across multiple dimensions including performance, efficiency, privacy preservation, denoising ability, NAS effectiveness, and transferability. Our systematic evaluation offers novel insights into how condensed graphs behave and the critical design choices that drive their success. These findings pave the way for future advancements in GC methods, enhancing both performance and expanding their real-world applications. Our code is available at https://github.com/Emory-Melody/GraphSlim/tree/main/benchmark.

LGJun 15, 2024Code
Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights

Zhikai Chen, Haitao Mao, Jingzhe Liu et al.

Given the ubiquity of graph data and its applications in diverse domains, building a Graph Foundation Model (GFM) that can work well across different graphs and tasks with a unified backbone has recently garnered significant interests. A major obstacle to achieving this goal stems from the fact that graphs from different domains often exhibit diverse node features. Inspired by multi-modal models that align different modalities with natural language, the text has recently been adopted to provide a unified feature space for diverse graphs. Despite the great potential of these text-space GFMs, current research in this field is hampered by two problems. First, the absence of a comprehensive benchmark with unified problem settings hinders a clear understanding of the comparative effectiveness and practical value of different text-space GFMs. Second, there is a lack of sufficient datasets to thoroughly explore the methods' full potential and verify their effectiveness across diverse settings. To address these issues, we conduct a comprehensive benchmark providing novel text-space datasets and comprehensive evaluation under unified problem settings. Empirical results provide new insights and inspire future research directions. Our code and data are publicly available from \url{https://github.com/CurryTang/TSGFM}.

LGJun 10, 2024Code
EpiLearn: A Python Library for Machine Learning in Epidemic Modeling

Zewen Liu, Yunxiao Li, Mingyang Wei et al.

EpiLearn is a Python toolkit developed for modeling, simulating, and analyzing epidemic data. Although there exist several packages that also deal with epidemic modeling, they are often restricted to mechanistic models or traditional statistical tools. As machine learning continues to shape the world, the gap between these packages and the latest models has become larger. To bridge the gap and inspire innovative research in epidemic modeling, EpiLearn not only provides support for evaluating epidemic models based on machine learning, but also incorporates comprehensive tools for analyzing epidemic data, such as simulation, visualization, transformations, etc. For the convenience of both epidemiologists and data scientists, we provide a unified framework for training and evaluation of epidemic models on two tasks: Forecasting and Source Detection. To facilitate the development of new models, EpiLearn follows a modular design, making it flexible and easy to use. In addition, an interactive web application is also developed to visualize the real-world or simulated epidemic data. Our package is available at https://github.com/Emory-Melody/EpiLearn.

LGFeb 17, 2022Code
Graph Data Augmentation for Graph Machine Learning: A Survey

Tong Zhao, Wei Jin, Yozen Liu et al.

Data augmentation has recently seen increased interest in graph machine learning given its demonstrated ability to improve model performance and generalization by added training data. Despite this recent surge, the area is still relatively under-explored, due to the challenges brought by complex, non-Euclidean structure of graph data, which limits the direct analogizing of traditional augmentation operations on other types of image, video or text data. Our work aims to give a necessary and timely overview of existing graph data augmentation methods; notably, we present a comprehensive and systematic survey of graph data augmentation approaches, summarizing the literature in a structured manner. We first introduce three different taxonomies for categorizing graph data augmentation methods from the data, task, and learning perspectives, respectively. Next, we introduce recent advances in graph data augmentation, differentiated by their methodologies and applications. We conclude by outlining currently unsolved challenges and directions for future research. Overall, our work aims to clarify the landscape of existing literature in graph data augmentation and motivates additional work in this area, providing a helpful resource for researchers and practitioners in the broader graph machine learning domain. Additionally, we provide a continuously updated reading list at https://github.com/zhao-tong/graph-data-augmentation-papers.

LGOct 14, 2021Code
Graph Condensation for Graph Neural Networks

Wei Jin, Lingxiao Zhao, Shichang Zhang et al.

Given the prevalence of large-scale graphs in real-world applications, the storage and time for training neural models have raised increasing concerns. To alleviate the concerns, we propose and study the problem of graph condensation for graph neural networks (GNNs). Specifically, we aim to condense the large, original graph into a small, synthetic and highly-informative graph, such that GNNs trained on the small graph and large graph have comparable performance. We approach the condensation problem by imitating the GNN training trajectory on the original graph through the optimization of a gradient matching loss and design a strategy to condense node futures and structural information simultaneously. Extensive experiments have demonstrated the effectiveness of the proposed framework in condensing different graph datasets into informative smaller graphs. In particular, we are able to approximate the original test accuracy by 95.3% on Reddit, 99.8% on Flickr and 99.0% on Citeseer, while reducing their graph size by more than 99.9%, and the condensed graphs can be used to train various GNN architectures.Code is released at https://github.com/ChandlerBang/GCond.

IRAug 12, 2021Code
Graph Trend Filtering Networks for Recommendations

Wenqi Fan, Xiaorui Liu, Wei Jin et al.

Recommender systems aim to provide personalized services to users and are playing an increasingly important role in our daily lives. The key of recommender systems is to predict how likely users will interact with items based on their historical online behaviors, e.g., clicks, add-to-cart, purchases, etc. To exploit these user-item interactions, there are increasing efforts on considering the user-item interactions as a user-item bipartite graph and then performing information propagation in the graph via Graph Neural Networks (GNNs). Given the power of GNNs in graph representation learning, these GNNs-based recommendation methods have remarkably boosted the recommendation performance. Despite their success, most existing GNNs-based recommender systems overlook the existence of interactions caused by unreliable behaviors (e.g., random/bait clicks) and uniformly treat all the interactions, which can lead to sub-optimal and unstable performance. In this paper, we investigate the drawbacks (e.g., non-adaptive propagation and non-robustness) of existing GNN-based recommendation methods. To address these drawbacks, we introduce a principled graph trend collaborative filtering method and propose the Graph Trend Filtering Networks for recommendations (GTN) that can capture the adaptive reliability of the interactions. Comprehensive experiments and ablation studies are presented to verify and understand the effectiveness of the proposed framework. Our implementation based on PyTorch is available at https://github.com/wenqifan03/GTN-SIGIR2022.

LGJul 5, 2021Code
Elastic Graph Neural Networks

Xiaorui Liu, Wei Jin, Yao Ma et al.

While many existing graph neural networks (GNNs) have been proven to perform $\ell_2$-based graph smoothing that enforces smoothness globally, in this work we aim to further enhance the local smoothness adaptivity of GNNs via $\ell_1$-based graph smoothing. As a result, we introduce a family of GNNs (Elastic GNNs) based on $\ell_1$ and $\ell_2$-based graph smoothing. In particular, we propose a novel and general message passing scheme into GNNs. This message passing algorithm is not only friendly to back-propagation training but also achieves the desired smoothing properties with a theoretical convergence guarantee. Experiments on semi-supervised learning tasks demonstrate that the proposed Elastic GNNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks. The implementation of Elastic GNNs is available at \url{https://github.com/lxiaorui/ElasticGNN}.

LGJun 10, 2021Code
Automated Self-Supervised Learning for Graphs

Wei Jin, Xiaorui Liu, Xiangyu Zhao et al.

Graph self-supervised learning has gained increasing attention due to its capacity to learn expressive node representations. Many pretext tasks, or loss functions have been designed from distinct perspectives. However, we observe that different pretext tasks affect downstream tasks differently cross datasets, which suggests that searching pretext tasks is crucial for graph self-supervised learning. Different from existing works focusing on designing single pretext tasks, this work aims to investigate how to automatically leverage multiple pretext tasks effectively. Nevertheless, evaluating representations derived from multiple pretext tasks without direct access to ground truth labels makes this problem challenging. To address this obstacle, we make use of a key principle of many real-world graphs, i.e., homophily, or the principle that "like attracts like," as the guidance to effectively search various self-supervised pretext tasks. We provide theoretical understanding and empirical evidence to justify the flexibility of homophily in this search task. Then we propose the AutoSSL framework which can automatically search over combinations of various self-supervised tasks. By evaluating the framework on 7 real-world datasets, our experimental results show that AutoSSL can significantly boost the performance on downstream tasks including node clustering and node classification compared with training under individual tasks. Code is released at https://github.com/ChandlerBang/AutoSSL.

LGNov 19, 2020Code
Node Similarity Preserving Graph Convolutional Networks

Wei Jin, Tyler Derr, Yiqi Wang et al.

Graph Neural Networks (GNNs) have achieved tremendous success in various real-world applications due to their strong ability in graph representation learning. GNNs explore the graph structure and node features by aggregating and transforming information within node neighborhoods. However, through theoretical and empirical analysis, we reveal that the aggregation process of GNNs tends to destroy node similarity in the original feature space. There are many scenarios where node similarity plays a crucial role. Thus, it has motivated the proposed framework SimP-GCN that can effectively and efficiently preserve node similarity while exploiting graph structure. Specifically, to balance information from graph structure and node features, we propose a feature similarity preserving aggregation which adaptively integrates graph structure and node features. Furthermore, we employ self-supervised learning to explicitly capture the complex feature similarity and dissimilarity relations between nodes. We validate the effectiveness of SimP-GCN on seven benchmark datasets including three assortative and four disassorative graphs. The results demonstrate that SimP-GCN outperforms representative baselines. Further probe shows various advantages of the proposed framework. The implementation of SimP-GCN is available at \url{https://github.com/ChandlerBang/SimP-GCN}.

LGJun 17, 2020Code
Self-supervised Learning on Graphs: Deep Insights and New Direction

Wei Jin, Tyler Derr, Haochen Liu et al.

The success of deep learning notoriously requires larger amounts of costly annotated data. This has led to the development of self-supervised learning (SSL) that aims to alleviate this limitation by creating domain specific pretext tasks on unlabeled data. Simultaneously, there are increasing interests in generalizing deep learning to the graph domain in the form of graph neural networks (GNNs). GNNs can naturally utilize unlabeled nodes through the simple neighborhood aggregation that is unable to thoroughly make use of unlabeled nodes. Thus, we seek to harness SSL for GNNs to fully exploit the unlabeled data. Different from data instances in the image and text domains, nodes in graphs present unique structure information and they are inherently linked indicating not independent and identically distributed (or i.i.d.). Such complexity is a double-edged sword for SSL on graphs. On the one hand, it determines that it is challenging to adopt solutions from the image and text domains to graphs and dedicated efforts are desired. On the other hand, it provides rich information that enables us to build SSL from a variety of perspectives. Thus, in this paper, we first deepen our understandings on when, why, and which strategies of SSL work with GNNs by empirically studying numerous basic SSL pretext tasks on graphs. Inspired by deep insights from the empirical studies, we propose a new direction SelfTask to build advanced pretext tasks that are able to achieve state-of-the-art performance on various real-world datasets. The specific experimental settings to reproduce our results can be found in \url{https://github.com/ChandlerBang/SelfTask-GNN}.

LGMay 20, 2020Code
Graph Structure Learning for Robust Graph Neural Networks

Wei Jin, Yao Ma, Xiaorui Liu et al.

Graph Neural Networks (GNNs) are powerful tools in representation learning for graphs. However, recent studies show that GNNs are vulnerable to carefully-crafted perturbations, called adversarial attacks. Adversarial attacks can easily fool GNNs in making predictions for downstream tasks. The vulnerability to adversarial attacks has raised increasing concerns for applying GNNs in safety-critical applications. Therefore, developing robust algorithms to defend adversarial attacks is of great significance. A natural idea to defend adversarial attacks is to clean the perturbed graph. It is evident that real-world graphs share some intrinsic properties. For example, many real-world graphs are low-rank and sparse, and the features of two adjacent nodes tend to be similar. In fact, we find that adversarial attacks are likely to violate these graph properties. Therefore, in this paper, we explore these properties to defend adversarial attacks on graphs. In particular, we propose a general framework Pro-GNN, which can jointly learn a structural graph and a robust graph neural network model from the perturbed graph guided by these properties. Extensive experiments on real-world graphs demonstrate that the proposed framework achieves significantly better performance compared with the state-of-the-art defense methods, even when the graph is heavily perturbed. We release the implementation of Pro-GNN to our DeepRobust repository for adversarial attacks and defenses (footnote: https://github.com/DSE-MSU/DeepRobust). The specific experimental settings to reproduce our results can be found in https://github.com/ChandlerBang/Pro-GNN.

LGMay 13, 2020Code
DeepRobust: A PyTorch Library for Adversarial Attacks and Defenses

Yaxin Li, Wei Jin, Han Xu et al.

DeepRobust is a PyTorch adversarial learning library which aims to build a comprehensive and easy-to-use platform to foster this research field. It currently contains more than 10 attack algorithms and 8 defense algorithms in image domain and 9 attack algorithms and 4 defense algorithms in graph domain, under a variety of deep learning architectures. In this manual, we introduce the main contents of DeepRobust with detailed instructions. The library is kept updated and can be found at https://github.com/DSE-MSU/DeepRobust.

LGMar 2, 2020Code
Adversarial Attacks and Defenses on Graphs: A Review, A Tool and Empirical Studies

Wei Jin, Yaxin Li, Han Xu et al.

Deep neural networks (DNNs) have achieved significant performance in various tasks. However, recent studies have shown that DNNs can be easily fooled by small perturbation on the input, called adversarial attacks. As the extensions of DNNs to graphs, Graph Neural Networks (GNNs) have been demonstrated to inherit this vulnerability. Adversary can mislead GNNs to give wrong predictions by modifying the graph structure such as manipulating a few edges. This vulnerability has arisen tremendous concerns for adapting GNNs in safety-critical applications and has attracted increasing research attention in recent years. Thus, it is necessary and timely to provide a comprehensive overview of existing graph adversarial attacks and the countermeasures. In this survey, we categorize existing attacks and defenses, and review the corresponding state-of-the-art methods. Furthermore, we have developed a repository with representative algorithms (https://github.com/DSE-MSU/DeepRobust/tree/master/deeprobust/graph). The repository enables us to conduct empirical studies to deepen our understandings on attacks and defenses on graphs.

LGAug 18, 2024
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs

Jiancheng Dong, Lei Jiang, Wei Jin et al.

Packing for Supervised Fine-Tuning (SFT) in autoregressive models involves concatenating data points of varying lengths until reaching the designed maximum length to facilitate GPU processing. However, randomly concatenating data points can lead to cross-contamination of sequences due to the significant difference in their subject matter. The mainstream approaches in SFT ensure that each token in the attention calculation phase only focuses on tokens within its own short sequence, without providing additional learning signals for the preceding context. To address these challenges, we introduce Threshold Filtering Packing (TFP), a method that selects samples with related context while maintaining sufficient diversity within the same pack. Our experiments show that TFP offers a simple-to-implement and scalable approach that significantly enhances SFT performance, with observed improvements of up to 7\% on GSM8K, 4\% on HumanEval. Furthermore, results from bias benchmark datasets highlight TFP's promising performance in improving fairness while also boosting prediction accuracy by 15\%.

AIMay 4
An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Gelei Xu, Ningzhi Tang, Xueyang Li et al.

Healthcare automation is shaped by local procedures and organizational constraints, so agent capabilities rarely transfer unchanged across settings. Agent skills, self-contained directories that package reusable procedures for AI agents, are emerging as a procedural layer for adapting healthcare agents across diverse healthcare settings. We present the first empirical analysis of healthcare agent skills, drawing on 557 healthcare-related skills filtered from 58,159 public skills on ClawHub and annotated along ten dimensions covering function, deployment context, autonomy, and safety. We find that public healthcare skills emphasize patient-facing workflow automation and monitoring rather than the diagnostic and treatment-oriented tasks foregrounded in healthcare-agent research; coverage of the healthcare lifecycle and specialized clinical inputs remains uneven; and general technical risk does not reliably capture clinical risk. These findings position healthcare skills as a procedural layer not yet addressed by current benchmarks and risk frameworks.

LGFeb 20, 2025
TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Juntong Ni, Zewen Liu, Shiyu Wang et al.

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.