Yanbo Wang

h-index5

4papers

34citations

Novelty54%

AI Score42

Ranked #59,888 of 194,257 authors (top 31%)#13,579 in LG (top 34%)

4 Papers

12.5LGApr 28, 2024Code

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

Minjie Wang, Quan Gan, David Wipf et al.

Although RDBs store vast amounts of rich, informative data spread across interconnected tables, the progress of predictive machine learning models as applied to such tasks arguably falls well behind advances in other domains such as computer vision or natural language processing. This deficit stems, at least in part, from the lack of established/public RDB benchmarks as needed for training and evaluation purposes. As a result, related model development thus far often defaults to tabular approaches trained on ubiquitous single-table benchmarks, or on the relational side, graph-based alternatives such as GNNs applied to a completely different set of graph datasets devoid of tabular characteristics. To more precisely target RDBs lying at the nexus of these two complementary regimes, we explore a broad class of baseline models predicated on: (i) converting multi-table datasets into graphs using various strategies equipped with efficient subsampling, while preserving tabular characteristics; and (ii) trainable models with well-matched inductive biases that output predictions based on these input subgraphs. Then, to address the dearth of suitable public benchmarks and reduce siloed comparisons, we assemble a diverse collection of (i) large-scale RDB datasets and (ii) coincident predictive tasks. From a delivery standpoint, we operationalize the above four dimensions (4D) of exploration within a unified, scalable open-source toolbox called 4DBInfer. We conclude by presenting evaluations using 4DBInfer, the results of which highlight the importance of considering each such dimension in the design of RDB predictive models, as well as the limitations of more naive approaches such as simply joining adjacent tables. Our source code is released at https://github.com/awslabs/multi-table-benchmark .

7.9LGNov 6, 2024Code

Reconsidering the Performance of GAE in Link Prediction

Weishuo Ma, Yanbo Wang, Xiyuan Wang et al.

Recent advancements in graph neural networks (GNNs) for link prediction have introduced sophisticated training techniques and model architectures. However, reliance on outdated baselines may exaggerate the benefits of these new approaches. To tackle this issue, we systematically explore Graph Autoencoders (GAEs) by applying model-agnostic tricks in recent methods and tuning hyperparameters. We find that a well-tuned GAE can match the performance of recent sophisticated models while offering superior computational efficiency on widely-used link prediction benchmarks. Our approach delivers substantial performance gains on datasets where structural information dominates and feature data is limited. Specifically, our GAE achieves a state-of-the-art Hits@100 score of 78.41\% on the ogbl-ppa dataset. Furthermore, we examine the impact of various tricks to uncover the reasons behind our success and to guide the design of future methods. Our study emphasizes the critical need to update baselines for a more accurate assessment of progress in GNNs for link prediction. Our code is available at https://github.com/GraphPKU/Refined-GAE.

16.4LGJun 12, 2024Code

Efficient Neural Common Neighbor for Temporal Graph Link Prediction

Xiaohui Zhang, Yanbo Wang, Xiyuan Wang et al.

Temporal graphs are widespread in real-world applications such as social networks, as well as trade and transportation networks. Predicting dynamic links within these evolving graphs is a key problem. Many memory-based methods use temporal interaction histories to generate node embeddings, which are then combined to predict links. However, these approaches primarily focus on individual node representations, often overlooking the inherently pairwise nature of link prediction. While some recent methods attempt to capture pairwise features, they tend to be limited by high computational complexity arising from repeated embedding calculations, making them unsuitable for large-scale datasets like the Temporal Graph Benchmark (TGB). To address the critical need for models that combine strong expressive power with high computational efficiency for link prediction on large temporal graphs, we propose Temporal Neural Common Neighbor (TNCN). Our model achieves this balance by adapting the powerful pairwise modeling principles of Neural Common Neighbor (NCN) to an efficient temporal architecture. TNCN improves upon NCN by efficiently preserving and updating temporal neighbor dictionaries for each node and by using multi-hop common neighbors to learn more expressive pairwise representations. TNCN achieves new state-of-the-art performance on Review from five large-scale real-world TGB datasets, 6 out of 7 datasets in the transductive setting and 3 out of 7 in the inductive setting on small- to medium-scale datasets. Additionally, TNCN demonstrates excellent scalability, outperforming prominent GNN baselines by up to 30.3 times in speed on large datasets. Our code is available at https://github.com/GraphPKU/TNCN.

8.3CLSep 20, 2025

ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions

Yue Huang, Zhengzhe Jiang, Xiaonan Luo et al.

Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.