Shichao Pei

LG
h-index13
18papers
1,177citations
Novelty48%
AI Score56

18 Papers

LGMay 2, 2022
Positive-Unlabeled Learning with Adversarial Data Augmentation for Knowledge Graph Completion

Zhenwei Tang, Shichao Pei, Zhao Zhang et al. · utoronto

Most real-world knowledge graphs (KG) are far from complete and comprehensive. This problem has motivated efforts in predicting the most plausible missing facts to complete a given KG, i.e., knowledge graph completion (KGC). However, existing KGC methods suffer from two main issues, 1) the false negative issue, i.e., the sampled negative training instances may include potential true facts; and 2) the data sparsity issue, i.e., true facts account for only a tiny part of all possible facts. To this end, we propose positive-unlabeled learning with adversarial data augmentation (PUDA) for KGC. In particular, PUDA tailors positive-unlabeled risk estimator for the KGC task to deal with the false negative issue. Furthermore, to address the data sparsity issue, PUDA achieves a data augmentation strategy by unifying adversarial training and positive-unlabeled learning under the positive-unlabeled minimax game. Extensive experimental results on real-world benchmark datasets demonstrate the effectiveness and compatibility of our proposed method.

LGFeb 1, 2023
Knowledge Distillation on Graphs: A Survey

Yijun Tian, Shichao Pei, Xiangliang Zhang et al.

Graph Neural Networks (GNNs) have attracted tremendous attention by demonstrating their capability to handle graph data. However, they are difficult to be deployed in resource-limited devices due to model sizes and scalability constraints imposed by the multi-hop data dependency. In addition, real-world graphs usually possess complex structural information and features. Therefore, to improve the applicability of GNNs and fully encode the complicated topological information, knowledge distillation on graphs (KDG) has been introduced to build a smaller yet effective model and exploit more knowledge from data, leading to model compression and performance improvement. Recently, KDG has achieved considerable progress with many studies proposed. In this survey, we systematically review these works. Specifically, we first introduce KDG challenges and bases, then categorize and summarize existing works of KDG by answering the following three questions: 1) what to distillate, 2) who to whom, and 3) how to distillate. Finally, we share our thoughts on future research directions.

AIApr 23, 2023
LogicRec: Recommendation with Users' Logical Requirements

Zhenwei Tang, Griffin Floto, Armin Toroghi et al. · utoronto

Users may demand recommendations with highly personalized requirements involving logical operations, e.g., the intersection of two requirements, where such requirements naturally form structured logical queries on knowledge graphs (KGs). To date, existing recommender systems lack the capability to tackle users' complex logical requirements. In this work, we formulate the problem of recommendation with users' logical requirements (LogicRec) and construct benchmark datasets for LogicRec. Furthermore, we propose an initial solution for LogicRec based on logical requirement retrieval and user preference retrieval, where we face two challenges. First, KGs are incomplete in nature. Therefore, there are always missing true facts, which entails that the answers to logical requirements can not be completely found in KGs. In this case, item selection based on the answers to logical queries is not applicable. We thus resort to logical query embedding (LQE) to jointly infer missing facts and retrieve items based on logical requirements. Second, answer sets are under-exploited. Existing LQE methods can only deal with query-answer pairs, where queries in our case are the intersected user preferences and logical requirements. However, the logical requirements and user preferences have different answer sets, offering us richer knowledge about the requirements and preferences by providing requirement-item and preference-item pairs. Thus, we design a multi-task knowledge-sharing mechanism to exploit these answer sets collectively. Extensive experimental results demonstrate the significance of the LogicRec task and the effectiveness of our proposed method.

AIMay 29, 2022
TAR: Neural Logical Reasoning across TBox and ABox

Zhenwei Tang, Shichao Pei, Xi Peng et al. · utoronto

Many ontologies, i.e., Description Logic (DL) knowledge bases, have been developed to provide rich knowledge about various domains. An ontology consists of an ABox, i.e., assertion axioms between two entities or between a concept and an entity, and a TBox, i.e., terminology axioms between two concepts. Neural logical reasoning (NLR) is a fundamental task to explore such knowledge bases, which aims at answering multi-hop queries with logical operations based on distributed representations of queries and answers. While previous NLR methods can give specific entity-level answers, i.e., ABox answers, they are not able to provide descriptive concept-level answers, i.e., TBox answers, where each concept is a description of a set of entities. In other words, previous NLR methods only reason over the ABox of an ontology while ignoring the TBox. In particular, providing TBox answers enables inferring the explanations of each query with descriptive concepts, which make answers comprehensible to users and are of great usefulness in the field of applied ontology. In this work, we formulate the problem of neural logical reasoning across TBox and ABox (TA-NLR), solving which needs to address challenges in incorporating, representing, and operating on concepts. We propose an original solution named TAR for TA-NLR. Firstly, we incorporate description logic based ontological axioms to provide the source of concepts. Then, we represent concepts and queries as fuzzy sets, i.e., sets whose elements have degrees of membership, to bridge concepts and queries with entities. Moreover, we design operators involving concepts on top of fuzzy set representation of concepts and queries for optimization and inference. Extensive experimental results on two real-world datasets demonstrate the effectiveness of TAR for TA-NLR.

LGOct 7, 2023
ReactionTeam: Teaming Experts for Divergent Thinking Beyond Typical Reaction Patterns

Taicheng Guo, Changsheng Ma, Xiuying Chen et al.

Reaction prediction, a critical task in synthetic chemistry, is to predict the outcome of a reaction based on given reactants. Generative models like Transformer have typically been employed to predict the reaction product. However, these likelihood-maximization models overlooked the inherent stochastic nature of chemical reactions, such as the multiple ways electrons can be redistributed among atoms during the reaction process. In scenarios where similar reactants could follow different electron redistribution patterns, these models typically predict the most common outcomes, neglecting less frequent but potentially crucial reaction patterns. These overlooked patterns, though rare, can lead to innovative methods for designing synthetic routes and significantly advance synthesis techniques. To address these limitations, we build a team of expert models to capture diverse plausible reaction outcomes for the same reactants, mimicking the divergent thinking of chemists. The proposed framework, ReactionTeam, is composed of specialized expert models, each trained to capture a distinct type of electron redistribution pattern in reaction, and a ranking expert that evaluates and orders the generated predictions. Experimental results across two widely used datasets and different data settings demonstrate that our proposed method achieves significantly better performance compared to existing state-of-the-art approaches.

78.3CVApr 25Code
PushupBench: Your VLM is not good at counting pushups

Shengzhi Li, Jiarun Chen, Karun Sharma et al.

Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score $\sim$6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)

48.0CLMay 18
Predictive Prefetching for Retrieval-Augmented Generation

Wuyang Zhang, Shichao Pei

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

63.6CRApr 22
Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways

Guanjie Lin, Yinxin Wan, Shichao Pei et al.

Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.

CLJan 21, 2024Code
Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang et al.

Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents' capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems.

93.7CRApr 7
Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use

Wuyang Zhang, Shichao Pei

Tool-use large language model (LLM) agents are increasingly deployed to support sensitive workflows, relying on tool calls for retrieval, external API access, and session memory management. While prior research has examined various threats, the risk of systematic data exfiltration by backdoored agents remains underexplored. In this work, we present Back-Reveal, a data exfiltration attack that embeds semantic triggers into fine-tuned LLM agents. When triggered, the backdoored agent invokes memory-access tool calls to retrieve stored user context and exfiltrates it via disguised retrieval tool calls. We further demonstrate that multi-turn interaction amplifies the impact of data exfiltration, as attacker-controlled retrieval responses can subtly steer subsequent agent behavior and user interactions, enabling sustained and cumulative information leakage over time. Our experimental results expose a critical vulnerability in LLM agents with tool access and highlight the need for defenses against exfiltration-oriented backdoors.

CLFeb 16, 2024
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

Shengzhi Li, Rongyu Lin, Shichao Pei

Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying language model was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.

LGApr 9, 2024
Zero-Shot Relational Learning for Multimodal Knowledge Graphs

Rui Cai, Shichao Pei, Xiangliang Zhang

Relational learning is an essential task in the domain of knowledge representation, particularly in knowledge graph completion (KGC). While relational learning in traditional single-modal settings has been extensively studied, exploring it within a multimodal KGC context presents distinct challenges and opportunities. One of the major challenges is inference on newly discovered relations without any associated training data. This zero-shot relational learning scenario poses unique requirements for multimodal KGC, i.e., utilizing multimodality to facilitate relational learning.However, existing works fail to support the leverage of multimodal information and leave the problem unexplored. In this paper, we propose a novel end-to-end framework, consisting of three components, i.e., multimodal learner, structure consolidator, and relation embedding generator, to integrate diverse multimodal information and knowledge graph structures to facilitate the zero-shot relational learning. Evaluation results on three multimodal knowledge graphs demonstrate the superior performance of our proposed method.

CRFeb 9, 2025
Injecting Universal Jailbreak Backdoors into LLMs in Minutes

Zhuowei Chen, Qiannan Zhang, Shichao Pei

Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal intervention in minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models' attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of JailbreakEdit, emphasizing the need for more advanced defense mechanisms in LLMs.

CLNov 7, 2024
Abstract2Appendix: Academic Reviews Enhance LLM Long-Context Capabilities

Shengzhi Li, Kittipat Kampa, Rongyu Lin et al.

Large language models (LLMs) have shown remarkable performance across various tasks, yet their ability to handle long-context reading remains challenging. This study explores the effectiveness of leveraging high-quality academic peer review data for fine-tuning LLMs to enhance their long-context capabilities. We compare the Direct Preference Optimization (DPO) method with the Supervised Fine-Tuning (SFT) method, demonstrating DPO's superiority and data efficiency. Our experiments show that the fine-tuned model achieves a 4.04-point improvement over phi-3 and a 2.6\% increase on the Qasper benchmark using only 2000 samples. Despite facing limitations in data scale and processing costs, this study underscores the potential of DPO and high-quality data in advancing LLM performance. Additionally, the zero-shot benchmark results indicate that aggregated high-quality human reviews are overwhelmingly preferred over LLM-generated responses, even for the most capable models like GPT-4o. This suggests that high-quality human reviews are extremely rich in information, reasoning, and long-context retrieval, capabilities that even the most advanced models have not fully captured. These findings highlight the high utility of leveraging human reviews to further advance the field.

AIOct 9, 2021
A Generic Knowledge Based Medical Diagnosis Expert System

Xin Huang, Xuejiao Tang, Wenbin Zhang et al.

In this paper, we design and implement a generic medical knowledge based system (MKBS) for identifying diseases from several symptoms. In this system, some important aspects like knowledge bases system, knowledge representation, inference engine have been addressed. The system asks users different questions and inference engines will use the certainty factor to prune out low possible solutions. The proposed disease diagnosis system also uses a graphical user interface (GUI) to facilitate users to interact with the expert system. Our expert system is generic and flexible, which can be integrated with any rule bases system in disease diagnosis.

IRJun 7, 2021
Scientific Dataset Discovery via Topic-level Recommendation

Basmah Altaf, Shichao Pei, Xiangliang Zhang

Data intensive research requires the support of appropriate datasets. However, it is often time-consuming to discover usable datasets matching a specific research topic. We formulate the dataset discovery problem on an attributed heterogeneous graph, which is composed of paper-paper citation, paper-dataset citation, and also paper content. We propose to characterize both paper and dataset nodes by their commonly shared latent topics, rather than learning user and item representations via canonical graph embedding models, because the usage of datasets and the themes of research projects can be understood on the common base of research topics. The relevant datasets to a given research project can then be inferred in the shared topic space. The experimental results show that our model can generate reasonable profiles for datasets, and recommend proper datasets for a query, which represents a research project linked with several papers.

LGSep 2, 2020
SAIL: Self-Augmented Graph Contrastive Learning

Lu Yu, Shichao Pei, Lizhong Ding et al.

This paper studies learning node representations with graph neural networks (GNNs) for unsupervised scenario. Specifically, we derive a theoretical analysis and provide an empirical demonstration about the non-steady performance of GNNs over different graph datasets, when the supervision signals are not appropriately defined. The performance of GNNs depends on both the node feature smoothness and the locality of graph structure. To smooth the discrepancy of node proximity measured by graph topology and node feature, we proposed SAIL - a novel \underline{S}elf-\underline{A}ugmented graph contrast\underline{i}ve \underline{L}earning framework, with two complementary self-distilling regularization modules, \emph{i.e.}, intra- and inter-graph knowledge distillation. We demonstrate the competitive performance of SAIL on a variety of graph applications. Even with a single GNN layer, SAIL has consistently competitive or even better performance on various benchmark datasets, comparing with state-of-the-art baselines.

IRMay 19, 2020
Addressing Class-Imbalance Problem in Personalized Ranking

Lu Yu, Shichao Pei, Chuxu Zhang et al.

Pairwise ranking models have been widely used to address recommendation problems. The basic idea is to learn the rank of users' preferred items through separating items into \emph{positive} samples if user-item interactions exist, and \emph{negative} samples otherwise. Due to the limited number of observable interactions, pairwise ranking models face serious \emph{class-imbalance} issues. Our theoretical analysis shows that current sampling-based methods cause the vertex-level imbalance problem, which makes the norm of learned item embeddings towards infinite after a certain training iterations, and consequently results in vanishing gradient and affects the model inference results. We thus propose an efficient \emph{\underline{Vi}tal \underline{N}egative \underline{S}ampler} (VINS) to alleviate the class-imbalance issue for pairwise ranking model, in particular for deep learning models optimized by gradient methods. The core of VINS is a bias sampler with reject probability that will tend to accept a negative candidate with a larger degree weight than the given positive item. Evaluation results on several real datasets demonstrate that the proposed sampling method speeds up the training procedure 30\% to 50\% for ranking models ranging from shallow to deep, while maintaining and even improving the quality of ranking results in top-N item recommendation.