Lijing Zhu

LG
h-index8
6papers
32citations
Novelty58%
AI Score39

6 Papers

CVMar 7, 2023
TMHOI: Translational Model for Human-Object Interaction Detection

Lijing Zhu, Qizhen Lan, Alvaro Velasquez et al.

Detecting human-object interactions (HOIs) is an intricate challenge in the field of computer vision. Existing methods for HOI detection heavily rely on appearance-based features, but these may not fully capture all the essential characteristics necessary for accurate detection. To overcome these challenges, we propose an innovative graph-based approach called TMGHOI (Translational Model for Human-Object Interaction Detection). Our method effectively captures the sentiment representation of HOIs by integrating both spatial and semantic knowledge. By representing HOIs as a graph, where the interaction components serve as nodes and their spatial relationships as edges. To extract crucial spatial and semantic information, TMGHOI employs separate spatial and semantic encoders. Subsequently, these encodings are combined to construct a knowledge graph that effectively captures the sentiment representation of HOIs. Additionally, the ability to incorporate prior knowledge enhances the understanding of interactions, further boosting detection accuracy. We conducted extensive evaluations on the widely-used HICO-DET datasets to demonstrate the effectiveness of TMGHOI. Our approach outperformed existing state-of-the-art graph-based methods by a significant margin, showcasing its potential as a superior solution for HOI detection. We are confident that TMGHOI has the potential to significantly improve the accuracy and efficiency of HOI detection. Its integration of spatial and semantic knowledge, along with its computational efficiency and practicality, makes it a valuable tool for researchers and practitioners in the computer vision community. As with any research, we acknowledge the importance of further exploration and evaluation on various datasets to establish the generalizability and robustness of our proposed method.

CLJun 9, 2025Code
ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding

Lijing Zhu, Qizhen Lan, Qing Tian et al.

Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: https://github.com/lijingzhu1/ETT-CKGE/tree/main

CVMar 31, 2025
CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization

Yingrui Ji, Xi Xiao, Gaofei Chen et al.

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP's generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.

LGSep 29, 2025
Leveraging Vulnerabilities in Temporal Graph Neural Networks via Strategic High-Impact Assaults

Dong Hyun Jeon, Lijing Zhu, Haifang Li et al.

Temporal Graph Neural Networks (TGNNs) have become indispensable for analyzing dynamic graphs in critical applications such as social networks, communication systems, and financial networks. However, the robustness of TGNNs against adversarial attacks, particularly sophisticated attacks that exploit the temporal dimension, remains a significant challenge. Existing attack methods for Spatio-Temporal Dynamic Graphs (STDGs) often rely on simplistic, easily detectable perturbations (e.g., random edge additions/deletions) and fail to strategically target the most influential nodes and edges for maximum impact. We introduce the High Impact Attack (HIA), a novel restricted black-box attack framework specifically designed to overcome these limitations and expose critical vulnerabilities in TGNNs. HIA leverages a data-driven surrogate model to identify structurally important nodes (central to network connectivity) and dynamically important nodes (critical for the graph's temporal evolution). It then employs a hybrid perturbation strategy, combining strategic edge injection (to create misleading connections) and targeted edge deletion (to disrupt essential pathways), maximizing TGNN performance degradation. Importantly, HIA minimizes the number of perturbations to enhance stealth, making it more challenging to detect. Comprehensive experiments on five real-world datasets and four representative TGNN architectures (TGN, JODIE, DySAT, and TGAT) demonstrate that HIA significantly reduces TGNN accuracy on the link prediction task, achieving up to a 35.55% decrease in Mean Reciprocal Rank (MRR) - a substantial improvement over state-of-the-art baselines. These results highlight fundamental vulnerabilities in current STDG models and underscore the urgent need for robust defenses that account for both structural and temporal dynamics.

LGFeb 15, 2025
E2CB2former: Effecitve and Explainable Transformer for CB2 Receptor Ligand Activity Prediction

Jiacheng Xie, Yingrui Ji, Linghuan Zeng et al.

Accurate prediction of CB2 receptor ligand activity is pivotal for advancing drug discovery targeting this receptor, which is implicated in inflammation, pain management, and neurodegenerative conditions. Although conventional machine learning and deep learning techniques have shown promise, their limited interpretability remains a significant barrier to rational drug design. In this work, we introduce CB2former, a framework that combines a Graph Convolutional Network with a Transformer architecture to predict CB2 receptor ligand activity. By leveraging the Transformer's self attention mechanism alongside the GCN's structural learning capability, CB2former not only enhances predictive performance but also offers insights into the molecular features underlying receptor activity. We benchmark CB2former against diverse baseline models including Random Forest, Support Vector Machine, K Nearest Neighbors, Gradient Boosting, Extreme Gradient Boosting, Multilayer Perceptron, Convolutional Neural Network, and Recurrent Neural Network and demonstrate its superior performance with an R squared of 0.685, an RMSE of 0.675, and an AUC of 0.940. Moreover, attention weight analysis reveals key molecular substructures influencing CB2 receptor activity, underscoring the model's potential as an interpretable AI tool for drug discovery. This ability to pinpoint critical molecular motifs can streamline virtual screening, guide lead optimization, and expedite therapeutic development. Overall, our results showcase the transformative potential of advanced AI approaches exemplified by CB2former in delivering both accurate predictions and actionable molecular insights, thus fostering interdisciplinary collaboration and innovation in drug discovery.

LGJun 25, 2024
HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction

Xi Xiao, Wentao Wang, Jiacheng Xie et al.

Drug target binding affinity (DTA) is a key criterion for drug screening. Existing experimental methods are time-consuming and rely on limited structural and domain information. While learning-based methods can model sequence and structural information, they struggle to integrate contextual data and often lack comprehensive modeling of drug-target interactions. In this study, we propose a novel DTA prediction method, termed HGTDP-DTA, which utilizes dynamic prompts within a hybrid Graph-Transformer framework. Our method generates context-specific prompts for each drug-target pair, enhancing the model's ability to capture unique interactions. The introduction of prompt tuning further optimizes the prediction process by filtering out irrelevant noise and emphasizing task-relevant information, dynamically adjusting the input features of the molecular graph. The proposed hybrid Graph-Transformer architecture combines structural information from Graph Convolutional Networks (GCNs) with sequence information captured by Transformers, facilitating the interaction between global and local information. Additionally, we adopted the multi-view feature fusion method to project molecular graph views and affinity subgraph views into a common feature space, effectively combining structural and contextual information. Experiments on two widely used public datasets, Davis and KIBA, show that HGTDP-DTA outperforms state-of-the-art DTA prediction methods in both prediction performance and generalization ability.