DCMar 28, 2023
Distributed Graph Embedding with Information-Oriented Random WalksPeng Fang, Arijit Khan, Siqiang Luo et al.
Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.
LGMar 24, 2023
Data Depth and Core-based Trend Detection on Blockchain Transaction NetworksJason Zhu, Arijit Khan, Cuneyt Gurcan Akcora
Blockchains are significantly easing trade finance, with billions of dollars worth of assets being transacted daily. However, analyzing these networks remains challenging due to the sheer volume and complexity of the data. We introduce a method named InnerCore that detects market manipulators within blockchain-based networks and offers a sentiment indicator for these networks. This is achieved through data depth-based core decomposition and centered motif discovery, ensuring scalability. InnerCore is a computationally efficient, unsupervised approach suitable for analyzing large temporal graphs. We demonstrate its effectiveness by analyzing and detecting three recent real-world incidents from our datasets: the catastrophic collapse of LunaTerra, the Proof-of-Stake switch of Ethereum, and the temporary peg loss of USDC - while also verifying our results against external ground truth. Our experiments show that InnerCore can match the qualified analysis accurately without human involvement, automating blockchain analysis in a scalable manner, while being more effective and efficient than baselines and state-of-the-art attributed change detection approach in dynamic graphs.
LGFeb 5
ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural NetworksYu Zhang, Sean Bin Yang, Arijit Khan et al.
Counterfactual explanations offer an intuitive way to interpret graph neural networks (GNNs) by identifying minimal changes that alter a model's prediction, thereby answering "what must differ for a different outcome?". In this work, we propose a novel framework, ATEX-CF that unifies adversarial attack techniques with counterfactual explanation generation-a connection made feasible by their shared goal of flipping a node's prediction, yet differing in perturbation strategy: adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance-level explanations that are both informative and realistic. Experiments on synthetic and real-world node classification benchmarks demonstrate that ATEX-CF generates faithful, concise, and plausible explanations, highlighting the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.
CLApr 18, 2025Code
CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language ModelsFeiyang Li, Peng Fang, Zhan Shi et al.
Chain-of-thought (CoT) reasoning boosts large language models' (LLMs) performance on complex tasks but faces two key limitations: a lack of reliability when solely relying on LLM-generated reasoning chains and lower reasoning performance from natural language prompts compared with code prompts. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo Program Prompting Execution, which promotes greater logical rigor by guiding LLMs to execute reasoning tasks as pseudo-programs. Evaluations on nine public datasets spanning three reasoning tasks reveal significant accuracy gains-ranging from 4.0% to 44.3%-over state-of-the-art methods. Furthermore, tests on four domain-specific datasets demonstrate exceptional accuracy and efficient execution, underscoring its practical applicability and scalability. Our code and data are available at https: //github.com/hustlfy123/CoT-RAG.
DBFeb 5
Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based ExplorationChuangtao Ma, Zeyu Zhang, Arijit Khan et al.
Retrieval-augmented generation (RAG) enhances LLM reasoning in knowledge-intensive tasks, but existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching. To address this limitation, we introduce CE-RAG4EM, a cost-efficient RAG architecture that reduces computation through blocking-based batch retrieval and generation. We also present a unified framework for analyzing and evaluating RAG systems for entity matching, focusing on blocking-aware optimizations and retrieval granularity. Extensive experiments suggest that CE-RAG4EM can achieve comparable or improved matching quality while substantially reducing end-to-end runtime relative to strong baselines. Our analysis further reveals that key configuration parameters introduce an inherent trade-off between performance and overhead, offering practical guidance for designing efficient and scalable RAG systems for entity matching and data integration.
LGOct 10, 2025Code
Large Language Model Prompt Datasets: An In-depth Analysis and InsightsYuanming Zhang, Yan Lin, Arijit Khan et al.
A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.
LGJan 4, 2024
View-based Explanations for Graph Neural NetworksTingyang Chen, Dazhuo Qiu, Yinghui Wu et al.
Generating explanations for graph neural networks (GNNs) has been studied to understand their behavior in analytical tasks such as graph classification. Existing approaches aim to understand the overall results of GNNs rather than providing explanations for specific class labels of interest, and may return explanation structures that are hard to access, nor directly queryable.We propose GVEX, a novel paradigm that generates Graph Views for EXplanation. (1) We design a two-tier explanation structure called explanation views. An explanation view consists of a set of graph patterns and a set of induced explanation subgraphs. Given a database G of multiple graphs and a specific class label l assigned by a GNN-based classifier M, it concisely describes the fraction of G that best explains why l is assigned by M. (2) We propose quality measures and formulate an optimization problem to compute optimal explanation views for GNN explanation. We show that the problem is $Σ^2_P$-hard. (3) We present two algorithms. The first one follows an explain-and-summarize strategy that first generates high-quality explanation subgraphs which best explain GNNs in terms of feature influence maximization, and then performs a summarization step to generate patterns. We show that this strategy provides an approximation ratio of 1/2. Our second algorithm performs a single-pass to an input node stream in batches to incrementally maintain explanation views, having an anytime quality guarantee of 1/4 approximation. Using real-world benchmark data, we experimentally demonstrate the effectiveness, efficiency, and scalability of GVEX. Through case studies, we showcase the practical applications of GVEX.
CLDec 20, 2024
Logical Consistency of Large Language Models in Fact-checkingBishwamittra Ghosh, Sarah Hasan, Naheed Anjum Arafat et al.
In recent years, large language models (LLMs) have demonstrated significant success in performing varied natural language tasks such as language translation, question-answering, summarizing, fact-checking, etc. Despite LLMs' impressive ability to generate human-like texts, LLMs are infamous for their inconsistent responses - a meaning-preserving change in the input query results in an inconsistent response and attributes to vulnerabilities of LLMs such as hallucination. Consequently, existing research focuses on simple paraphrasing-based consistency assessment of LLMs, and ignores complex queries that necessitate an even better understanding of logical reasoning by an LLM. Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from knowledge graphs (KGs). Our contributions are threefold. Benchmark: We introduce three logical fact-checking datasets over KGs for community development towards logically consistent LLMs. Assessment: We propose consistency measures of LLMs on propositional logic queries and demonstrate that existing LLMs lack logical consistency, especially on complex queries. Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs on the complex fact-checking task with KG contexts. We have made our source code and benchmarks available.
CRApr 28, 2024
Machine Learning for Blockchain Data Analysis: Progress and OpportunitiesPoupak Azad, Cuneyt Gurcan Akcora, Arijit Khan
Blockchain technology has rapidly emerged to mainstream attention, while its publicly accessible, heterogeneous, massive-volume, and temporal data are reminiscent of the complex dynamics encountered during the last decade of big data. Unlike any prior data source, blockchain datasets encompass multiple layers of interactions across real-world entities, e.g., human users, autonomous programs, and smart contracts. Furthermore, blockchain's integration with cryptocurrencies has introduced financial aspects of unprecedented scale and complexity such as decentralized finance, stablecoins, non-fungible tokens, and central bank digital currencies. These unique characteristics present both opportunities and challenges for machine learning on blockchain data. On one hand, we examine the state-of-the-art solutions, applications, and future directions associated with leveraging machine learning for blockchain data analysis critical for the improvement of blockchain technology such as e-crime detection and trends prediction. On the other hand, we shed light on the pivotal role of blockchain by providing vast datasets and tools that can catalyze the growth of the evolving machine learning ecosystem. This paper serves as a comprehensive resource for researchers, practitioners, and policymakers, offering a roadmap for navigating this dynamic and transformative field.
DBJan 15, 2025
Knowledge Graph-based Retrieval-Augmented Generation for Schema MatchingChuangtao Ma, Sriom Chakrabarti, Arijit Khan et al.
Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.
LGApr 30, 2024
Generating Robust Counterfactual Witnesses for Graph Neural NetworksDazhuo Qiu, Mengying Wang, Arijit Khan et al.
This paper introduces a new class of explanation structures, called robust counterfactual witnesses (RCWs), to provide robust, both counterfactual and factual explanations for graph neural networks. Given a graph neural network M, a robust counterfactual witness refers to the fraction of a graph G that are counterfactual and factual explanation of the results of M over G, but also remains so for any "disturbed" G by flipping up to k of its node pairs. We establish the hardness results, from tractable results to co-NP-hardness, for verifying and generating robust counterfactual witnesses. We study such structures for GNN-based node classification, and present efficient algorithms to verify and generate RCWs. We also provide a parallel algorithm to verify and generate RCWs for large graphs with scalability guarantees. We experimentally verify our explanation generation process for benchmark datasets, and showcase their applications.
LGNov 3, 2024
Enhancing Glucose Level Prediction of ICU Patients through Hierarchical Modeling of Irregular Time-SeriesHadi Mehdizavareh, Arijit Khan, Simon Lebech Cichosz
Accurately predicting blood glucose (BG) levels of ICU patients is critical, as both hypoglycemia (BG < 70 mg/dL) and hyperglycemia (BG > 180 mg/dL) are associated with increased morbidity and mortality. This study presents a proof-of-concept machine learning framework, the Multi-source Irregular Time-Series Transformer (MITST), designed to predict BG levels in ICU patients. In contrast to existing methods that rely heavily on manual feature engineering or utilize limited Electronic Health Record (EHR) data sources, MITST integrates diverse clinical data--including laboratory results, medications, and vital signs without predefined aggregation. The model leverages a hierarchical Transformer architecture, designed to capture interactions among features within individual timestamps, temporal dependencies across different timestamps, and semantic relationships across multiple data sources. Evaluated using the extensive eICU database (200,859 ICU stays across 208 hospitals), MITST achieves a statistically significant ( p < 0.001 ) average improvement of 1.7 percentage points (pp) in AUROC and 1.8 pp in AUPRC over a state-of-the-art random forest baseline. Crucially, for hypoglycemia--a rare but life-threatening condition--MITST increases sensitivity by 7.2 pp, potentially enabling hundreds of earlier interventions across ICU populations. The flexible architecture of MITST allows seamless integration of new data sources without retraining the entire model, enhancing its adaptability for clinical decision support. While this study focuses on predicting BG levels, we also demonstrate MITST's ability to generalize to a distinct clinical task (in-hospital mortality prediction), highlighting its potential for broader applicability in ICU settings. MITST thus offers a robust and extensible solution for analyzing complex, multi-source, irregular time-series data.
73.8DBApr 3
LLM+Graph@VLDB'2025 Workshop SummaryYixiang Fang, Arijit Khan, Tianxing Wu et al.
The integration of large language models (LLMs) with graph-structured data has become a pivotal and fast evolving research frontier, drawing strong interest from both academia and industry. The 2nd LLM+Graph Workshop, co-located with the 51st International Conference on Very Large Data Bases (VLDB 2025) in London, focused on advancing algorithms and systems that bridge LLMs, graph data management, and graph machine learning for practical applications. This report highlights the key research directions, challenges, and innovative solutions presented by the workshop's speakers.
LGAug 5, 2025
NAEx: A Plug-and-Play Framework for Explaining Network AlignmentShruti Saxena, Arijit Khan, Joydeep Chandra
Network alignment (NA) identifies corresponding nodes across multiple networks, with applications in domains like social networks, co-authorship, and biology. Despite advances in alignment models, their interpretability remains limited, making it difficult to understand alignment decisions and posing challenges in building trust, particularly in high-stakes domains. To address this, we introduce NAEx, a plug-and-play, model-agnostic framework that explains alignment models by identifying key subgraphs and features influencing predictions. NAEx addresses the key challenge of preserving the joint cross-network dependencies on alignment decisions by: (1) jointly parameterizing graph structures and feature spaces through learnable edge and feature masks, and (2) introducing an optimization objective that ensures explanations are both faithful to the original predictions and enable meaningful comparisons of structural and feature-based similarities between networks. NAEx is an inductive framework that efficiently generates NA explanations for previously unseen data. We introduce evaluation metrics tailored to alignment explainability and demonstrate NAEx's effectiveness and efficiency on benchmark datasets by integrating it with four representative NA models.
LGJun 22, 2025
SliceGX: Layer-wise GNN Explanation with Model-slicingTingting Zhu, Tingyang Chen, Yinghui Wu et al.
Ensuring the trustworthiness of graph neural networks (GNNs) as black-box models requires effective explanation methods. Existing GNN explanations typically apply input perturbations to identify subgraphs that are responsible for the occurrence of the final output of GNNs. However, such approaches lack finer-grained, layer-wise analysis of how intermediate representations contribute to the final result, capabilities that are crucial for model diagnosis and architecture optimization. This paper introduces SliceGX, a novel GNN explanation approach that generates explanations at specific GNN layers in a progressive manner. Given a GNN M, a set of selected intermediate layers, and a target layer, SliceGX automatically segments M into layer blocks ("model slice") and discovers high-quality explanatory subgraphs in each layer block that clarifies the occurrence of output of M at the targeted layer. Although finding such layer-wise explanations is computationally challenging, we develop efficient algorithms and optimization techniques that incrementally generate and maintain these subgraphs with provable approximation guarantees. Additionally, SliceGX offers a SPARQL-like query interface, providing declarative access and search capacities for the generated explanations. Through experiments on large real-world graphs and representative GNN architectures, we verify the effectiveness and efficiency of SliceGX, and illustrate its practical utility in supporting model debugging.
LGMay 12, 2025
Interpreting Graph Inference with Skyline ExplanationsDazhuo Qiu, Haolai Che, Arijit Khan et al.
Inference queries have been routinely issued to graph machine learning models such as graph neural networks (GNNs) for various network analytical tasks. Nevertheless, GNN outputs are often hard to interpret comprehensively. Existing methods typically conform to individual pre-defined explainability measures (such as fidelity), which often leads to biased, ``one-side'' interpretations. This paper introduces skyline explanation, a new paradigm that interprets GNN outputs by simultaneously optimizing multiple explainability measures of users' interests. (1) We propose skyline explanations as a Pareto set of explanatory subgraphs that dominate others over multiple explanatory measures. We formulate skyline explanation as a multi-criteria optimization problem, and establish its hardness results. (2) We design efficient algorithms with an onion-peeling approach, which strategically prioritizes nodes and removes unpromising edges to incrementally assemble skyline explanations. (3) We also develop an algorithm to diversify the skyline explanations to enrich the comprehensive interpretation. (4) We introduce efficient parallel algorithms with load-balancing strategies to scale skyline explanation for large-scale GNN-based inference. Using real-world and synthetic graphs, we experimentally verify our algorithms' effectiveness and scalability.
DBMay 23, 2023
Knowledge Graphs QueryingArijit Khan
Knowledge graphs (KGs) such as DBpedia, Freebase, YAGO, Wikidata, and NELL were constructed to store large-scale, real-world facts as (subject, predicate, object) triples -- that can also be modeled as a graph, where a node (a subject or an object) represents an entity with attributes, and a directed edge (a predicate) is a relationship between two entities. Querying KGs is critical in web search, question answering (QA), semantic search, personal assistants, fact checking, and recommendation. While significant progress has been made on KG construction and curation, thanks to deep learning recently we have seen a surge of research on KG querying and QA. The objectives of our survey are two-fold. First, research on KG querying has been conducted by several communities, such as databases, data mining, semantic web, machine learning, information retrieval, and natural language processing (NLP), with different focus and terminologies; and also in diverse topics ranging from graph databases, query languages, join algorithms, graph patterns matching, to more sophisticated KG embedding and natural language questions (NLQs). We aim at uniting different interdisciplinary topics and concepts that have been developed for KG querying. Second, many recent advances on KG and query embedding, multimodal KG, and KG-QA come from deep learning, IR, NLP, and computer vision domains. We identify important challenges of KG querying that received less attention by graph databases, and by the DB community in general, e.g., incomplete KG, semantic matching, multimodal data, and NLQs. We conclude by discussing interesting opportunities for the data management community, for instance, KG as a unified data model and vector-based query processing.
DBOct 15, 2019
Efficiently Embedding Dynamic Knowledge GraphsTianxing Wu, Arijit Khan, Melvin Yong et al.
Knowledge graph (KG) embedding encodes the entities and relations from a KG into low-dimensional vector spaces to support various applications such as KG completion, question answering, and recommender systems. In real world, knowledge graphs (KGs) are dynamic and evolve over time with addition or deletion of triples. However, most existing models focus on embedding static KGs while neglecting dynamics. To adapt to the changes in a KG, these models need to be retrained on the whole KG with a high time cost. In this paper, to tackle the aforementioned problem, we propose a new context-aware Dynamic Knowledge Graph Embedding (DKGE) method which supports the embedding learning in an online fashion. DKGE introduces two different representations (i.e., knowledge embedding and contextual element embedding) for each entity and each relation, in the joint modeling of entities and relations as well as their contexts, by employing two attentive graph convolutional networks, a gate strategy, and translation operations. This effectively helps limit the impacts of a KG update in certain regions, not in the entire graph, so that DKGE can rapidly acquire the updated KG embedding by a proposed online learning algorithm. Furthermore, DKGE can also learn KG embedding from scratch. Experiments on the tasks of link prediction and question answering in a dynamic environment demonstrate the effectiveness and efficiency of DKGE.