AIJun 2Code
scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality TranslationJiabei Cheng, Jingbo Zhou, Jun Xia et al.
Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.
LGApr 6, 2022Code
Structure-aware Protein Self-supervised LearningCan Chen, Jingbo Zhou, Fan Wang et al.
Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.The code of the proposed method is available in \url{https://github.com/GGchen1997/STEPS_Bioinformatics}.
LGNov 26, 2022
A Contextual Master-Slave Framework on Urban Region Graph for Urban Village DetectionCongxi Xiao, Jingbo Zhou, Jizhou Huang et al. · baidu
Urban villages (UVs) refer to the underdeveloped informal settlement falling behind the rapid urbanization in a city. Since there are high levels of social inequality and social risks in these UVs, it is critical for city managers to discover all UVs for making appropriate renovation policies. Existing approaches to detecting UVs are labor-intensive or have not fully addressed the unique challenges in UV detection such as the scarcity of labeled UVs and the diverse urban patterns in different regions. To this end, we first build an urban region graph (URG) to model the urban area in a hierarchically structured way. Then, we design a novel contextual master-slave framework to effectively detect the urban village from the URG. The core idea of such a framework is to firstly pre-train a basis (or master) model over the URG, and then to adaptively derive specific (or slave) models from the basis model for different regions. The proposed framework can learn to balance the generality and specificity for UV detection in an urban area. Finally, we conduct extensive experiments in three cities to demonstrate the effectiveness of our approach.
CROct 9, 2023
LLM for SoC Security: A Paradigm ShiftDipayan Saha, Shams Tarek, Katayoon Yahyaei et al.
As the ubiquity and complexity of system-on-chip (SoC) designs increase across electronic devices, the task of incorporating security into an SoC design flow poses significant challenges. Existing security solutions are inadequate to provide effective verification of modern SoC designs due to their limitations in scalability, comprehensiveness, and adaptability. On the other hand, Large Language Models (LLMs) are celebrated for their remarkable success in natural language understanding, advanced reasoning, and program synthesis tasks. Recognizing an opportunity, our research delves into leveraging the emergent capabilities of Generative Pre-trained Transformers (GPTs) to address the existing gaps in SoC security, aiming for a more efficient, scalable, and adaptable methodology. By integrating LLMs into the SoC security verification paradigm, we open a new frontier of possibilities and challenges to ensure the security of increasingly complex SoCs. This paper offers an in-depth analysis of existing works, showcases practical case studies, demonstrates comprehensive experiments, and provides useful promoting guidelines. We also present the achievements, prospects, and challenges of employing LLM in different SoC security verification tasks.
LGAug 8, 2022
SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022Jingbo Zhou, Xinjiang Lu, Yixiong Xiao et al.
The variability of wind power supply can present substantial challenges to incorporating wind power into a grid system. Thus, Wind Power Forecasting (WPF) has been widely recognized as one of the most critical issues in wind power integration and operation. There has been an explosion of studies on wind power forecasting problems in the past decades. Nevertheless, how to well handle the WPF problem is still challenging, since high prediction accuracy is always demanded to ensure grid stability and security of supply. We present a unique Spatial Dynamic Wind Power Forecasting dataset: SDWPF, which includes the spatial distribution of wind turbines, as well as the dynamic context factors. Whereas, most of the existing datasets have only a small number of wind turbines without knowing the locations and context information of wind turbines at a fine-grained time scale. By contrast, SDWPF provides the wind power data of 134 wind turbines from a wind farm over half a year with their relative positions and internal statuses. We use this dataset to launch the Baidu KDD Cup 2022 to examine the limit of current WPF solutions. The dataset is released at https://aistudio.baidu.com/aistudio/competition/detail/152/0/datasets.
LGJun 21, 2023
Spatial Heterophily Aware Graph Neural NetworksCongxi Xiao, Jingbo Zhou, Jizhou Huang et al. · baidu
Graph Neural Networks (GNNs) have been broadly applied in many urban applications upon formulating a city as an urban graph whose nodes are urban objects like regions or points of interest. Recently, a few enhanced GNN architectures have been developed to tackle heterophily graphs where connected nodes are dissimilar. However, urban graphs usually can be observed to possess a unique spatial heterophily property; that is, the dissimilarity of neighbors at different spatial distances can exhibit great diversity. This property has not been explored, while it often exists. To this end, in this paper, we propose a metric, named Spatial Diversity Score, to quantitatively measure the spatial heterophily and show how it can influence the performance of GNNs. Indeed, our experimental investigation clearly shows that existing heterophilic GNNs are still deficient in handling the urban graph with high spatial diversity score. This, in turn, may degrade their effectiveness in urban applications. Along this line, we propose a Spatial Heterophily Aware Graph Neural Network (SHGNN), to tackle the spatial diversity of heterophily of urban graphs. Based on the key observation that spatially close neighbors on the urban graph present a more similar mode of difference to the central node, we first design a rotation-scaling spatial aggregation module, whose core idea is to properly group the spatially close neighbors and separately process each group with less diversity inside. Then, a heterophily-sensitive spatial interaction module is designed to adaptively capture the commonality and diverse dissimilarity in different spatial groups. Extensive experiments on three real-world urban datasets demonstrate the superiority of our SHGNN over several its competitors.
DCNov 24, 2022
Multi-Job Intelligent Scheduling with Cross-Device Federated LearningJi Liu, Juncheng Jia, Beichen Ma et al.
Recent years have witnessed a large amount of decentralized data in various (edge) devices of end-users, while the decentralized data aggregation remains complicated for machine learning jobs because of regulations and laws. As a practical approach to handling decentralized data, Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data. The servers schedule devices to jobs within the training process of FL. In contrast, device scheduling with multiple jobs in FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel. The multi-job FL framework is composed of a system model and a scheduling method. The system model enables a parallel training process of multiple jobs, with a cost model based on the data fairness and the training time of diverse devices during the parallel training process. We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method, which corresponds to a small cost while scheduling devices to multiple jobs. We conduct extensive experimentation with diverse jobs and datasets. The experimental results reveal that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 12.73 times faster) and accuracy (up to 46.4% higher).
AIDec 1, 2022
Human-instructed Deep Hierarchical Generative Learning for Automated Urban PlanningDongjie Wang, Lingfei Wu, Denghui Zhang et al.
The essential task of urban planning is to generate the optimal land-use configuration of a target area. However, traditional urban planning is time-consuming and labor-intensive. Deep generative learning gives us hope that we can automate this planning process and come up with the ideal urban plans. While remarkable achievements have been obtained, they have exhibited limitations in lacking awareness of: 1) the hierarchical dependencies between functional zones and spatial grids; 2) the peer dependencies among functional zones; and 3) human regulations to ensure the usability of generated configurations. To address these limitations, we develop a novel human-instructed deep hierarchical generative model. We rethink the urban planning generative task from a unique functionality perspective, where we summarize planning requirements into different functionality projections for better urban plan generation. To this end, we develop a three-stage generation process from a target area to zones to grids. The first stage is to label the grids of a target area with latent functionalities to discover functional zones. The second stage is to perceive the planning requirements to form urban functionality projections. We propose a novel module: functionalizer to project the embedding of human instructions and geospatial contexts to the zone-level plan to obtain such projections. Each projection includes the information of land-use portfolios and the structural dependencies across spatial grids in terms of a specific urban function. The third stage is to leverage multi-attentions to model the zone-zone peer dependencies of the functionality projections to generate grid-level land-use configurations. Finally, we present extensive experiments to demonstrate the effectiveness of our framework.
CLJan 5, 2023
Towards Table-to-Text Generation with Pretrained Language Model: A Table Structure Understanding and Text Deliberating ApproachMiao Chen, Xinjiang Lu, Tong Xu et al.
Although remarkable progress on the neural table-to-text methods has been made, the generalization issues hinder the applicability of these models due to the limited source tables. Large-scale pretrained language models sound like a promising solution to tackle such issues. However, how to effectively bridge the gap between the structured table and the text input by fully leveraging table information to fuel the pretrained model is still not well explored. Besides, another challenge of integrating the deliberation mechanism into the text-to-text pretrained model for solving the table-to-text task remains seldom studied. In this paper, to implement the table-to-text generation with pretrained language model, we propose a table structure understanding and text deliberating approach, namely TASD. Specifically, we devise a three-layered multi-head attention network to realize the table-structure-aware text generation model with the help of the pretrained language model. Furthermore, a multi-pass decoder framework is adopted to enhance the capability of polishing generated text for table descriptions. The empirical studies, as well as human evaluation, on two public datasets, validate that our approach can generate faithful and fluent descriptive texts for different types of tables.
QMApr 7Code
MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell AnnotationYehui Yang, Zelin Zang, Changxi Chi et al.
Automated cellular reasoning faces a core dichotomy: supervised methods fall into the Reference Trap and fail to generalize to out-of-distribution cell states, while large language models (LLMs), without grounded biological priors, suffer from a Signal-to-Noise Paradox that produces spurious associations. We propose MAT-Cell, a neuro-symbolic reasoning framework that reframes single-cell analysis from black-box classification into constructive, verifiable proof generation. MAT-Cell injects symbolic constraints through adaptive Retrieval-Augmented Generation (RAG) to ground neural reasoning in biological axioms and reduce transcriptomic noise. It further employs a dialectic verification process with homogeneous rebuttal agents to audit and prune reasoning paths, forming syllogistic derivation trees that enforce logical consistency.Across large-scale and cross-species benchmarks, MAT-Cell significantly outperforms state-of-the-art (SOTA) models and maintains robust per-formance in challenging scenarios where baselinemethods severely degrade. Code is available at https://gith ub.com/jiangliu91/MAT-Cell-A-Mul ti-Agent-Tree-Structured-Reasoni ng-Framework-for-Batch-Level-Sin gle-Cell-Annotation.
LGNov 12, 2022
Robust Training of Graph Neural Networks via Noise GovernanceSiyi Qian, Haochao Ying, Renjun Hu et al.
Graph Neural Networks (GNNs) have become widely-used models for semi-supervised learning. However, the robustness of GNNs in the presence of label noise remains a largely under-explored problem. In this paper, we consider an important yet challenging scenario where labels on nodes of graphs are not only noisy but also scarce. In this scenario, the performance of GNNs is prone to degrade due to label noise propagation and insufficient learning. To address these issues, we propose a novel RTGNN (Robust Training of Graph Neural Networks via Noise Governance) framework that achieves better robustness by learning to explicitly govern label noise. More specifically, we introduce self-reinforcement and consistency regularization as supplemental supervision. The self-reinforcement supervision is inspired by the memorization effects of deep neural networks and aims to correct noisy labels. Further, the consistency regularization prevents GNNs from overfitting to noisy labels via mimicry loss in both the inter-view and intra-view perspectives. To leverage such supervisions, we divide labels into clean and noisy types, rectify inaccurate labels, and further generate pseudo-labels on unlabeled nodes. Supervision for nodes with different types of labels is then chosen adaptively. This enables sufficient learning from clean labels while limiting the impact of noisy ones. We conduct extensive experiments to evaluate the effectiveness of our RTGNN framework, and the results validate its consistent superior performance over state-of-the-art methods with two types of label noises and various noise rates.
CVJul 21, 2023
Tuning Pre-trained Model via Moment ProbingMingze Gao, Qilong Wang, Zhenyi Lin et al.
Recently, efficient fine-tuning of large-scale pre-trained models has attracted increasing research interests, where linear probing (LP) as a fundamental module is involved in exploiting the final representations for task-dependent classification. However, most of the existing methods focus on how to effectively introduce a few of learnable parameters, and little work pays attention to the commonly used LP module. In this paper, we propose a novel Moment Probing (MP) method to further explore the potential of LP. Distinguished from LP which builds a linear classification head based on the mean of final features (e.g., word tokens for ViT) or classification tokens, our MP performs a linear classifier on feature distribution, which provides the stronger representation ability by exploiting richer statistical information inherent in features. Specifically, we represent feature distribution by its characteristic function, which is efficiently approximated by using first- and second-order moments of features. Furthermore, we propose a multi-head convolutional cross-covariance (MHC$^3$) to compute second-order moments in an efficient and effective manner. By considering that MP could affect feature learning, we introduce a partially shared module to learn two recalibrating parameters (PSRP) for backbones based on MP, namely MP$_{+}$. Extensive experiments on ten benchmarks using various models show that our MP significantly outperforms LP and is competitive with counterparts at less training cost, while our MP$_{+}$ achieves state-of-the-art performance.
LGJun 15, 2023
Multi-Temporal Relationship Inference in Urban AreasShuangli Li, Jingbo Zhou, Ji Liu et al.
Finding multiple temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning. While some efforts have been made on finding static relationships among locations, little attention is focused on studying time-aware location relationships. Indeed, abundant location-based human activities are time-varying and the availability of these data enables a new paradigm for understanding the dynamic relationships in a period among connective locations. To this end, we propose to study a new problem, namely multi-Temporal relationship inference among locations (Trial for short), where the major challenge is how to integrate dynamic and geographical influence under the relationship sparsity constraint. Specifically, we propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet) with two collaborative components: spatially evolving graph convolution module (SEConv) and spatially evolving self-supervised learning strategy (SE-SSL). SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing. In addition, SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity. Finally, experiments on four real-world datasets demonstrate the superiority of our method over several state-of-the-art approaches.
LGApr 20Code
LoReC: Rethinking Large Language Models for Graph Data AnalysisHongyu Zhan, Qixin Wang, Yusen Tan et al.
The advent of Large Language Models (LLMs) has fundamentally reshaped the way we interact with graphs, giving rise to a new paradigm called GraphLLM. As revealed in recent studies, graph learning can benefit from LLMs. However, we observe limited benefits when we directly utilize LLMs to make predictions for graph-related tasks within GraphLLM paradigm, which even yields suboptimal results compared to conventional GNN-based approaches. Through in-depth analysis, we find this failure can be attributed to LLMs' limited capability for processing graph data and their tendency to overlook graph information. To address this issue, we propose LoReC (Look, Remember, and Contrast), a novel plug-and-play method for GraphLLM paradigm, which enhances LLM's understanding of graph data through three stages: (1) Look: redistributing attention to graph; (2) Remember: re-injecting graph information into the Feed-Forward Network (FFN); (3) Contrast: rectifying the vanilla logits produced in the decoding process. Extensive experiments demonstrate that LoReC brings notable improvements over current GraphLLM methods and outperforms GNN-based approaches across diverse datasets. The implementation is available at https://github.com/Git-King-Zhan/LoReC.
LGAug 31, 2023
Irregular Traffic Time Series Forecasting Based on Asynchronous Spatio-Temporal Graph Convolutional NetworkWeijia Zhang, Le Zhang, Jindong Han et al.
Accurate traffic forecasting is crucial for the development of Intelligent Transportation Systems (ITS), playing a pivotal role in modern urban traffic management. Traditional forecasting methods, however, struggle with the irregular traffic time series resulting from adaptive traffic signal controls, presenting challenges in asynchronous spatial dependency, irregular temporal dependency, and predicting variable-length sequences. To this end, we propose an Asynchronous Spatio-tEmporal graph convolutional nEtwoRk (ASeer) tailored for irregular traffic time series forecasting. Specifically, we first propose an Asynchronous Graph Diffusion Network to capture the spatial dependency between asynchronously measured traffic states regulated by adaptive traffic signals. After that, to capture the temporal dependency within irregular traffic state sequences, a personalized time encoding is devised to embed the continuous time signals. Then, we propose a Transformable Time-aware Convolution Network, which adapts meta-filters for time-aware convolution on the sequences with inconsistent temporal flow. Additionally, a Semi-Autoregressive Prediction Network, comprising a state evolution unit and a semi-autoregressive predictor, is designed to predict variable-length traffic sequences effectively and efficiently. Extensive experiments on a newly established benchmark demonstrate the superiority of ASeer compared with twelve competitive baselines across six metrics.
AIJan 28Code
OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task ExecutionLe Zhang, Yixiong Xiao, Xinjiang Lu et al.
Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
LGOct 21, 2023
Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction ModelsLihang Liu, Shanzhuo Zhang, Donglong He et al.
Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating its exceptional precision and robust transferability in predicting binding confirmation. In addition, our investigation reveals the scaling laws governing pre-trained protein-ligand structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and the volume of pre-training data. Moreover, we applied HelixDock to several drug discovery-related tasks to validate its practical utility. HelixDock demonstrates outstanding capabilities on both cross-docking and structure-based virtual screening benchmarks.
CLJul 29, 2024
Improving Retrieval Augmented Language Model with Self-ReasoningYuan Xia, Jingbo Zhou, Zhenhui Shi et al.
The Retrieval-Augmented Language Model (RALM) has shown remarkable performance on knowledge-intensive tasks by incorporating external knowledge during inference, which mitigates the factual hallucinations inherited in large language models (LLMs). Despite these advancements, challenges persist in the implementation of RALMs, particularly concerning their reliability and traceability. To be specific, the irrelevant document retrieval may result in unhelpful response generation or even deteriorate the performance of LLMs, while the lack of proper citations in generated outputs complicates efforts to verify the trustworthiness of the models. To this end, we propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs, whose core idea is to leverage reasoning trajectories generated by the LLM itself. The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. We have evaluated our framework across four public datasets (two short-form QA datasets, one long-form QA dataset, and one fact verification dataset) to demonstrate the superiority of our method, which can outperform existing state-of-the-art models and can achieve comparable performance with GPT-4, while only using 2,000 training samples.
QMFeb 23Code
Regressor-guided Diffusion Model for De Novo Peptide Sequencing with Explicit Mass ControlShaorong Chen, Jingbo Zhou, Jun Xia
The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint, that a predicted peptide's mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post-processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide-level mass loss guides model optimization, while at inference, regressor-based guidance from gradient-based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state-of-the-art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.
LGJul 12, 2024
Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtXZhiyuan Chen, Tianhao Chen, Chenggang Xie et al.
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. These approaches restrict the understanding and generation of multimodal protein data. In contrast, large multimodal models have demonstrated potential capabilities in generating any-to-any content like text, images, and videos, thus enriching user interactions across various domains. Integrating these multimodal model technologies into protein research offers significant promise by potentially transforming how proteins are studied. To this end, we introduce HelixProtX, a system built upon the large multimodal model, aiming to offer a comprehensive solution to protein research by supporting any-to-any protein modality generation. Unlike existing methods, it allows for the transformation of any input protein modality into any desired protein modality. The experimental results affirm the advanced capabilities of HelixProtX, not only in generating functional descriptions from amino acid sequences but also in executing critical tasks such as designing protein sequences and structures from textual descriptions. Preliminary findings indicate that HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models. By integrating multimodal large models into protein research, HelixProtX opens new avenues for understanding protein biology, thereby promising to accelerate scientific discovery.
LGMay 18
scHelix: Asymmetric Dual-Stream Integration via Explicit Gene-Level DisentanglementXichen Yan, Zelin Zang, Changxi Chi et al.
A critical challenge in single-cell RNA sequencing (scRNA-seq) integration is resolving the tension between eliminating batch effects and maintaining biological fidelity. While recent evidence indicates that batch effects manifest heterogeneously across genes, most existing methods process the transcriptome uniformly, frequently resulting in over-correction and loss of subtle biological signals. To address this, we present scHelix, a dataset-adaptive framework that fundamentally changes how features are processed by explicitly partitioning genes into domain-invariant Anchors and domain-sensitive Variants at the input level. scHelix utilizes a dual-stream sparse diffusion encoder equipped with stop-gradient graph caching to efficiently learn multi-scale structural representations. The core of our approach is a novel asymmetric Align-Refine-Fuse protocol: the unstable Variant stream is first aligned to the robust topology of the Anchor stream, followed by a conservative refinement phase where the Anchor stream absorbs denoised details via bounded residual gating. This divide-and-conquer architecture prevents shortcut learning and ensures robust batch removal without compromising the integrity of biological clusters. Extensive benchmarking demonstrates that scHelix outperforms state-of-the-art methods.
AIMay 18
DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute AcquisitionLe Zhang, Shengming Zhang, Rui Zha et al.
Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9\% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.
LGFeb 23
VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token AttentionJingbo Zhou, Jun Xia, Siyuan Li et al.
Graph Transformer has demonstrated impressive capabilities in the field of graph representation learning. However, existing approaches face two critical challenges: (1) most models suffer from exponentially increasing computational complexity, making it difficult to scale to large graphs; (2) attention mechanisms based on node-level operations limit the flexibility of the model and result in poor generalization performance in out-of-distribution (OOD) scenarios. To address these issues, we propose \textbf{VecFormer} (the \textbf{Vec}tor Quantized Graph Trans\textbf{former}), an efficient and highly generalizable model for node classification, particularly under OOD settings. VecFormer adopts a two-stage training paradigm. In the first stage, two codebooks are used to reconstruct the node features and the graph structure, aiming to learn the rich semantic \texttt{Graph Codes}. In the second stage, attention mechanisms are performed at the \texttt{Graph Token} level based on the transformed cross codebook, reducing computational complexity while enhancing the model's generalization capability. Extensive experiments on datasets of various sizes demonstrate that VecFormer outperforms the existing Graph Transformer in both performance and speed.
LGJan 30
Unrewarded Exploration in Large Language Models Reveals Latent Learning from PsychologyJian Xiong, Jingbo Zhou, Zihan Zhou et al.
Latent learning, classically theorized by Tolman, shows that biological agents (e.g., rats) can acquire internal representations of their environment without rewards, enabling rapid adaptation once rewards are introduced. In contrast, from a cognitive science perspective, reward learning remains overly dependent on external feedback, limiting flexibility and generalization. Although recent advances in the reasoning capabilities of large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, mark a significant breakthrough, these models still rely primarily on reward-centric reinforcement learning paradigms. Whether and how the well-established phenomenon of latent learning in psychology can inform or emerge within LLMs' training remains largely unexplored. In this work, we present novel findings from our experiments that LLMs also exhibit the latent learning dynamics. During an initial phase of unrewarded exploration, LLMs display modest performance improvements, as this phase allows LLMs to organize task-relevant knowledge without being constrained by reward-driven biases, and performance is further enhanced once rewards are introduced. LLMs post-trained under this two-stage exploration regime ultimately achieve higher competence than those post-trained with reward-based reinforcement learning throughout. Beyond these empirical observations, we also provide theoretical analyses for our experiments explaining why unrewarded exploration yields performance gains, offering a mechanistic account of these dynamics. Specifically, we conducted extensive experiments across multiple model families and diverse task domains to establish the existence of the latent learning dynamics in LLMs.
LGJan 16, 2023
Adaptive Depth Graph Attention NetworksJingbo Zhou, Yixuan Du, Ruqiong Zhang et al.
As one of the most popular GNN architectures, the graph attention networks (GAT) is considered the most advanced learning architecture for graph representation and has been widely used in various graph mining tasks with impressive results. However, since GAT was proposed, none of the existing studies have provided systematic insight into the relationship between the performance of GAT and the number of layers, which is a critical issue in guiding model performance improvement. In this paper, we perform a systematic experimental evaluation and based on the experimental results, we find two important facts: (1) the main factor limiting the accuracy of the GAT model as the number of layers increases is the oversquashing phenomenon; (2) among the previous improvements applied to the GNN model, only the residual connection can significantly improve the GAT model performance. We combine these two important findings to provide a theoretical explanation that it is the residual connection that mitigates the loss of original feature information due to oversquashing and thus improves the deep GAT model performance. This provides empirical insights and guidelines for researchers to design the GAT variant model with appropriate depth and well performance. To demonstrate the effectiveness of our proposed guidelines, we propose a GAT variant model-ADGAT that adaptively selects the number of layers based on the sparsity of the graph, and experimentally demonstrate that the effectiveness of our model is significantly improved over the original GAT.
AINov 30, 2025
ChartAnchor: Chart Grounding with Structural-Semantic FidelityXinhang Li, Jingbo Zhou, Pengfei Luo et al.
Recent advances in multimodal large language models (MLLMs) highlight the need for benchmarks that rigorously evaluate structured chart comprehension.Chart grounding refers to the bidirectional alignment between a chart's visual appearance and the structured semantics. This task requires models to produce a symbolic specification that faithfully captures the chart's visual and structural intent, while also recovering the underlying tabular data with precise values and relationships. Chart grounding directly reflects a model's capabilities in numerical reasoning, multimodal alignment, and structural reconstruction, and has several important applications in real-world scenarios.Existing benchmarks, constrained by narrow chart diversity, isolated tasks, and incomplete evaluation frameworks, fail to holistically assess grounding. To address this, we propose ChartAnchor, a comprehensive benchmark of 8k+ chart-table-code triples spanning 30 chart types drawn from diverse real-world and augmented sources. ChartAnchor introduces two complementary tasks: chart-to-code generation (synthesizing executable code to replicate charts) and controlled chart-to-table reconstruction (extracting exact data with predefined headers), enabling cross-validation of visual and numerical fidelity. A multi-level evaluation framework integrates semantic validation, stylistic analysis, and perceptual metrics to assess both structural and content-level correctness. Extensive experiments on MLLMs reveal critical limitations in numerical precision and code synthesis, emphasizing the need for structured reasoning beyond surface-level perception. By unifying symbolic and data-driven grounding, ChartAnchor establishes a rigorous foundation for chart grounding, offering meaningful insights for advancing MLLMs in scientific, financial, and industrial domains.
AROct 6, 2025Code
DeepV: A Model-Agnostic Retrieval-Augmented Framework for Verilog Code Generation with a High-Quality Knowledge BaseZahin Ibnat, Paul E. Calzada, Rasin Mohammed Ihtemam et al.
As large language models (LLMs) continue to be integrated into modern technology, there has been an increased push towards code generation applications, which also naturally extends to hardware design automation. LLM-based solutions for register transfer level (RTL) code generation for intellectual property (IP) designs have grown, especially with fine-tuned LLMs, prompt engineering, and agentic approaches becoming popular in literature. However, a gap has been exposed in these techniques, as they fail to integrate novel IPs into the model's knowledge base, subsequently resulting in poorly generated code. Additionally, as general-purpose LLMs continue to improve, fine-tuned methods on older models will not be able to compete to produce more accurate and efficient designs. Although some retrieval augmented generation (RAG) techniques exist to mitigate challenges presented in fine-tuning approaches, works tend to leverage low-quality codebases, incorporate computationally expensive fine-tuning in the frameworks, or do not use RAG directly in the RTL generation step. In this work, we introduce DeepV: a model-agnostic RAG framework to generate RTL designs by enhancing context through a large, high-quality dataset without any RTL-specific training. Our framework benefits the latest commercial LLM, OpenAI's GPT-5, with a near 17% increase in performance on the VerilogEval benchmark. We host DeepV for use by the community in a Hugging Face (HF) Space: https://huggingface.co/spaces/FICS-LLM/DeepV.
LGMay 9, 2023Code
Deep Graph Neural Networks via Posteriori-Sampling-based Node-Adaptive Residual ModuleJingbo Zhou, Yixuan Du, Ruqiong Zhang et al.
Graph Neural Networks (GNNs), a type of neural network that can learn from graph-structured data through neighborhood information aggregation, have shown superior performance in various downstream tasks. However, as the number of layers increases, node representations become indistinguishable, which is known as over-smoothing. To address this issue, many residual methods have emerged. In this paper, we focus on the over-smoothing issue and related residual methods. Firstly, we revisit over-smoothing from the perspective of overlapping neighborhood subgraphs, and based on this, we explain how residual methods can alleviate over-smoothing by integrating multiple orders neighborhood subgraphs to avoid the indistinguishability of the single high-order neighborhood subgraphs. Additionally, we reveal the drawbacks of previous residual methods, such as the lack of node adaptability and severe loss of high-order neighborhood subgraph information, and propose a \textbf{Posterior-Sampling-based, Node-Adaptive Residual module (PSNR)}. We theoretically demonstrate that PSNR can alleviate the drawbacks of previous residual methods. Furthermore, extensive experiments verify the superiority of the PSNR module in fully observed node classification and missing feature scenarios. Our code is available at https://github.com/jingbo02/PSNR-GNN.
DBMar 28, 2016Code
A Generic Inverted Index Framework for Similarity Search on the GPU - Technical ReportJingbo Zhou, Qi Guo, H. V. Jagadish et al.
We propose a novel generic inverted index framework on the GPU (called GENIE), aiming to reduce the programming complexity of the GPU for parallel similarity search of different data types. Not every data type and similarity measure are supported by GENIE, but many popular ones are. We present the system design of GENIE, and demonstrate similarity search with GENIE on several data types along with a theoretical analysis of search results. A new concept of locality sensitive hashing (LSH) named $τ$-ANN search, and a novel data structure c-PQ on the GPU are also proposed for achieving this purpose. Extensive experiments on different real-life datasets demonstrate the efficiency and effectiveness of our framework. The implemented system has been released as open source.
CRMar 28, 2024
Enhancing Trust and Privacy in Distributed Networks: A Comprehensive Survey on Blockchain-based Federated LearningJi Liu, Chunlu Chen, Yu Li et al.
While centralized servers pose a risk of being a single point of failure, decentralized approaches like blockchain offer a compelling solution by implementing a consensus mechanism among multiple entities. Merging distributed computing with cryptographic techniques, decentralized technologies introduce a novel computing paradigm. Blockchain ensures secure, transparent, and tamper-proof data management by validating and recording transactions via consensus across network nodes. Federated Learning (FL), as a distributed machine learning framework, enables participants to collaboratively train models while safeguarding data privacy by avoiding direct raw data exchange. Despite the growing interest in decentralized methods, their application in FL remains underexplored. This paper presents a thorough investigation into Blockchain-based FL (BCFL), spotlighting the synergy between blockchain's security features and FL's privacy-preserving model training capabilities. First, we present the taxonomy of BCFL from three aspects, including decentralized, separate networks, and reputation-based architectures. Then, we summarize the general architecture of BCFL systems, providing a comprehensive perspective on FL architectures informed by blockchain. Afterward, we analyze the application of BCFL in healthcare, IoT, and other privacy-sensitive areas. Finally, we identify future research directions of BCFL.
QMMar 9, 2024
AdaNovo: Adaptive \emph{De Novo} Peptide Sequencing with Conditional Mutual InformationJun Xia, Shaorong Chen, Jingbo Zhou et al.
Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo's state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.
IRMar 13, 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective ReasoningPengfei Luo, Jingbo Zhou, Tong Xu et al.
With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.
DCJul 14, 2025
Efficient Federated Learning with Heterogeneous Data and Adaptive DropoutJi Liu, Beichen Ma, Qiaolin Yu et al.
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices. The data distributed among the edge devices is highly heterogeneous. Thus, FL faces the challenge of data distribution and heterogeneity, where non-Independent and Identically Distributed (non-IID) data across edge devices may yield in significant accuracy drop. Furthermore, the limited computation and communication capabilities of edge devices increase the likelihood of stragglers, thus leading to slow model convergence. In this paper, we propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD). FedDH dynamically adjusts the weights of each local model within the model aggregation process based on the non-IID degree of heterogeneous data to deal with the statistical data heterogeneity. FedAD performs neuron-adaptive operations in response to heterogeneous devices to improve accuracy while achieving superb efficiency. The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and computation cost (up to 15.0% smaller).
CRJun 25, 2025
SV-LLM: An Agentic Approach for SoC Security Verification using Large Language ModelsDipayan Saha, Shams Tarek, Hasan Al Shaikh et al.
Ensuring the security of complex system-on-chips (SoCs) designs is a critical imperative, yet traditional verification techniques struggle to keep pace due to significant challenges in automation, scalability, comprehensiveness, and adaptability. The advent of large language models (LLMs), with their remarkable capabilities in natural language understanding, code generation, and advanced reasoning, presents a new paradigm for tackling these issues. Moving beyond monolithic models, an agentic approach allows for the creation of multi-agent systems where specialized LLMs collaborate to solve complex problems more effectively. Recognizing this opportunity, we introduce SV-LLM, a novel multi-agent assistant system designed to automate and enhance SoC security verification. By integrating specialized agents for tasks like verification question answering, security asset identification, threat modeling, test plan and property generation, vulnerability detection, and simulation-based bug validation, SV-LLM streamlines the workflow. To optimize their performance in these diverse tasks, agents leverage different learning paradigms, such as in-context learning, fine-tuning, and retrieval-augmented generation (RAG). The system aims to reduce manual intervention, improve accuracy, and accelerate security analysis, supporting proactive identification and mitigation of risks early in the design cycle. We demonstrate its potential to transform hardware security practices through illustrative case studies and experiments that showcase its applicability and efficacy.
LGMay 20, 2025
AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage MomentumJian Xiong, Jingbo Zhou, Jingyong Ye et al.
Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.
LGMar 6, 2025
TimeFound: A Foundation Model for Time Series ForecastingCongxi Xiao, Jingbo Zhou, Yixiong Xiao et al.
We present TimeFound, an encoder-decoder transformer-based time series foundation model for out-of-the-box zero-shot forecasting. To handle time series data from various domains, TimeFound employs a multi-resolution patching strategy to capture complex temporal patterns at multiple scales. We pre-train our model with two sizes (200M and 710M parameters) on a large time-series corpus comprising both real-world and synthetic datasets. Over a collection of unseen datasets across diverse domains and forecasting horizons, our empirical evaluations suggest that TimeFound can achieve superior or competitive zero-shot forecasting performance, compared to state-of-the-art time series foundation models.
QMMay 6, 2025
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding BiotypeChangxi Chi, Jun Xia, Jingbo Zhou et al.
Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance.
LGJun 26, 2025
Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit BridgesChangxi Chi, Jun Xia, Yufei Huang et al.
Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell's phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired. Existing methods either attempt to forcibly pair unpaired data using random sampling, or neglect the inherent relationship between unperturbed and perturbed cells during the modeling. In this work, we propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions, effectively addressing the challenge of unpaired data. We further interpret this framework as a form of data augmentation. We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way, and further incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles. Moreover, gene expression under the same perturbation often varies significantly across cells, frequently exhibiting a bimodal distribution that reflects intrinsic heterogeneity. To capture this, we introduce a more suitable evaluation metric. We propose Unlasting, dual conditional diffusion models that overcome the problem of unpaired single-cell perturbation data and strengthen the model's insight into perturbations under the guidance of the GRN, with a dedicated mask model designed to improve generation quality by predicting silent genes. In addition, we introduce a biologically grounded evaluation metric that better reflects the inherent heterogeneity in single-cell responses.
LGMar 5, 2025
NodeReg: Mitigating the Imbalance and Distribution Shift Effects in Semi-Supervised Node Classification via Norm ConsistencyShenzhi Yang, Jun Xia, Jingbo Zhou et al.
Aggregating information from neighboring nodes benefits graph neural networks (GNNs) in semi-supervised node classification tasks. Nevertheless, this mechanism also renders nodes susceptible to the influence of their neighbors. For instance, this will occur when the neighboring nodes are imbalanced or the neighboring nodes contain noise, which can even affect the GNN's ability to generalize out of distribution. We find that ensuring the consistency of the norm for node representations can significantly reduce the impact of these two issues on GNNs. To this end, we propose a regularized optimization method called NodeReg that enforces the consistency of node representation norms. This method is simple but effective and satisfies Lipschitz continuity, thus facilitating stable optimization and significantly improving semi-supervised node classification performance under the above two scenarios. To illustrate, in the imbalance scenario, when training a GCN with an imbalance ratio of 0.1, NodeReg outperforms the most competitive baselines by 1.4%-25.9% in F1 score across five public datasets. Similarly, in the distribution shift scenario, NodeReg outperforms the most competitive baseline by 1.4%-3.1% in accuracy.
GNNov 17, 2025
MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token MergingSiyuan Li, Kai Yu, Anna Wang et al.
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.
LGOct 9, 2025
PRESCRIBE: Predicting Single-Cell Responses with Bayesian EstimationJiabei Cheng, Changxi Chi, Jingbo Zhou et al.
In single-cell perturbation prediction, a central task is to forecast the effects of perturbing a gene unseen in the training data. The efficacy of such predictions depends on two factors: (1) the similarity of the target gene to those covered in the training data, which informs model (epistemic) uncertainty, and (2) the quality of the corresponding training data, which reflects data (aleatoric) uncertainty. Both factors are critical for determining the reliability of a prediction, particularly as gene perturbation is an inherently stochastic biochemical process. In this paper, we propose PRESCRIBE (PREdicting Single-Cell Response wIth Bayesian Estimation), a multivariate deep evidential regression framework designed to measure both sources of uncertainty jointly. Our analysis demonstrates that PRESCRIBE effectively estimates a confidence score for each prediction, which strongly correlates with its empirical accuracy. This capability enables the filtering of untrustworthy results, and in our experiments, it achieves steady accuracy improvements of over 3% compared to comparable baselines.
CRJul 22, 2025
SVAgent: AI Agent for Hardware Security Verification AssertionRui Guo, Avinash Ayalasomayajula, Henian Li et al.
Verification using SystemVerilog assertions (SVA) is one of the most popular methods for detecting circuit design vulnerabilities. However, with the globalization of integrated circuit design and the continuous upgrading of security requirements, the SVA development model has exposed major limitations. It is not only inefficient in development, but also unable to effectively deal with the increasing number of security vulnerabilities in modern complex integrated circuits. In response to these challenges, this paper proposes an innovative SVA automatic generation framework SVAgent. SVAgent introduces a requirement decomposition mechanism to transform the original complex requirements into a structured, gradually solvable fine-grained problem-solving chain. Experiments have shown that SVAgent can effectively suppress the influence of hallucinations and random answers, and the key evaluation indicators such as the accuracy and consistency of the SVA are significantly better than existing frameworks. More importantly, we successfully integrated SVAgent into the most mainstream integrated circuit vulnerability assessment framework and verified its practicality and reliability in a real engineering design environment.
CEJan 6, 2025
CHAT: Beyond Contrastive Graph Transformer for Link Prediction in Heterogeneous NetworksShengming Zhang, Le Zhang, Jingbo Zhou et al.
Link prediction in heterogeneous networks is crucial for understanding the intricacies of network structures and forecasting their future developments. Traditional methodologies often face significant obstacles, including over-smoothing-wherein the excessive aggregation of node features leads to the loss of critical structural details-and a dependency on human-defined meta-paths, which necessitate extensive domain knowledge and can be inherently restrictive. These limitations hinder the effective prediction and analysis of complex heterogeneous networks. In response to these challenges, we propose the Contrastive Heterogeneous grAph Transformer (CHAT). CHAT introduces a novel sampling-based graph transformer technique that selectively retains nodes of interest, thereby obviating the need for predefined meta-paths. The method employs an innovative connection-aware transformer to encode node sequences and their interconnections with high fidelity, guided by a dual-faceted loss function specifically designed for heterogeneous network link prediction. Additionally, CHAT incorporates an ensemble link predictor that synthesizes multiple samplings to achieve enhanced prediction accuracy. We conducted comprehensive evaluations of CHAT using three distinct drug-target interaction (DTI) datasets. The empirical results underscore CHAT's superior performance, outperforming both general-task approaches and models specialized in DTI prediction. These findings substantiate the efficacy of CHAT in addressing the complex problem of link prediction in heterogeneous networks.
QMJun 16, 2024
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in ProteomicsJingbo Zhou, Shaorong Chen, Jun Xia et al.
Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the high-throughput analysis of protein composition in biological tissues. Many deep learning methods have been developed for \emph{de novo} peptide sequencing task, i.e., predicting the peptide sequence for the observed mass spectrum. However, two key challenges seriously hinder the further advancement of this important task. Firstly, since there is no consensus for the evaluation datasets, the empirical results in different research papers are often not comparable, leading to unfair comparison. Secondly, the current methods are usually limited to amino acid-level or peptide-level precision and recall metrics. In this work, we present the first unified benchmark NovoBench for \emph{de novo} peptide sequencing, which comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics. Recent impressive methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $π$-HelixNovo are integrated into our framework. In addition to amino acid-level and peptide-level precision and recall, we evaluate the models' performance in terms of identifying post-tranlational modifications (PTMs), efficiency and robustness to peptide length, noise peaks and missing fragment ratio, which are important influencing factors while seldom be considered. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development.
CRMar 12, 2024
A Framework for Cost-Effective and Self-Adaptive LLM Shaking and Recovery MechanismZhiyu Chen, Yu Li, Suochao Zhang et al.
As Large Language Models (LLMs) gain great success in real-world applications, an increasing number of users are seeking to develop and deploy their customized LLMs through cloud services. Nonetheless, in some specific domains, there are still concerns regarding cost and trade-offs between privacy issues and accuracy. In this study, we introduce a cost-effective and self-adaptive LLM shaking tuning and recovery mechanism, named CypherTalk. With carefully designed horizontal and vertical shaking operators, we can achieve comparable accuracy results with SOTA privacy-preserving LLM schemes using Cryptography-based or Differential Privacy-based methods. Experiments also show that with the CypherTalk framework, users can achieve reliable accuracy when using optimized shaking operator settings. To our best knowledge, this is the first work that considers cost, and trade-off between model utility and privacy in LLM scenarios.
CLMay 9, 2023
PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural NetworkQiwei Lang, Jingbo Zhou, Haoyi Wang et al.
The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
DCDec 11, 2021
Efficient Device Scheduling with Multi-Job Federated LearningChendi Zhou, Ji Liu, Juncheng Jia et al.
Recent years have witnessed a large amount of decentralized data in multiple (edge) devices of end-users, while the aggregation of the decentralized data remains difficult for machine learning jobs due to laws or regulations. Federated Learning (FL) emerges as an effective approach to handling decentralized data without sharing the sensitive raw data, while collaboratively training global machine learning models. The servers in FL need to select (and schedule) devices during the training process. However, the scheduling of devices for multiple jobs with FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework to enable the parallel training process of multiple jobs. The framework consists of a system model and two scheduling methods. In the system model, we propose a parallel training process of multiple jobs, and construct a cost model based on the training time and the data fairness of various devices during the training process of diverse jobs. We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost. We conduct extensive experimentation with multiple jobs and datasets. The experimental results show that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher).
LGSep 24, 2021
Adversarial Neural Trip RecommendationLinlang Jiang, Jingbo Zhou, Tong Xu et al.
Trip recommender system, which targets at recommending a trip consisting of several ordered Points of Interest (POIs), has long been treated as an important application for many location-based services. Currently, most prior arts generate trips following pre-defined objectives based on constraint programming, which may fail to reflect the complex latent patterns hidden in the human mobility data. And most of these methods are usually difficult to respond in real time when the number of POIs is large. To that end, we propose an Adversarial Neural Trip Recommendation (ANT) framework to tackle the above challenges. First of all, we devise a novel attention-based encoder-decoder trip generator that can learn the correlations among POIs and generate well-designed trips under given constraints. Another novelty of ANT relies on an adversarial learning strategy integrating with reinforcement learning to guide the trip generator to produce high-quality trips. For this purpose, we introduce a discriminator, which distinguishes the generated trips from real-life trips taken by users, to provide reward signals to optimize the generator. Moreover, we devise a novel pre-train schema based on learning from demonstration, which speeds up the convergence to achieve a sufficient-and-efficient training process. Extensive experiments on four real-world datasets validate the effectiveness and efficiency of our proposed ANT framework, which demonstrates that ANT could remarkably outperform the state-of-the-art baselines with short response time.
LGSep 24, 2021
GeomGCL: Geometric Graph Contrastive Learning for Molecular Property PredictionShuangli Li, Jingbo Zhou, Tong Xu et al.
Recently many efforts have been devoted to applying graph neural networks (GNNs) to molecular property prediction which is a fundamental task for computational drug and material discovery. One of major obstacles to hinder the successful prediction of molecule property by GNNs is the scarcity of labeled data. Though graph contrastive learning (GCL) methods have achieved extraordinary performance with insufficient labeled data, most focused on designing data augmentation schemes for general graphs. However, the fundamental property of a molecule could be altered with the augmentation method (like random perturbation) on molecular graphs. Whereas, the critical geometric information of molecules remains rarely explored under the current GNN and GCL architectures. To this end, we propose a novel graph contrastive learning method utilizing the geometry of the molecule across 2D and 3D views, which is named GeomGCL. Specifically, we first devise a dual-view geometric message passing network (GeomMPNN) to adaptively leverage the rich information of both 2D and 3D graphs of a molecule. The incorporation of geometric properties at different levels can greatly facilitate the molecular representation learning. Then a novel geometric graph contrastive scheme is designed to make both geometric views collaboratively supervise each other to improve the generalization ability of GeomMPNN. We evaluate GeomGCL on various downstream property prediction tasks via a finetune process. Experimental results on seven real-life molecular datasets demonstrate the effectiveness of our proposed GeomGCL against state-of-the-art baselines.
QMJul 21, 2021
Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding AffinityShuangli Li, Jingbo Zhou, Tong Xu et al.
Drug discovery often relies on the successful prediction of protein-ligand binding affinity. Recent advances have shown great promise in applying graph neural networks (GNNs) for better affinity prediction by learning the representations of protein-ligand complexes. However, existing solutions usually treat protein-ligand complexes as topological graph data, thus the biomolecular structural information is not fully utilized. The essential long-range interactions among atoms are also neglected in GNN models. To this end, we propose a structure-aware interactive graph neural network (SIGN) which consists of two components: polar-inspired graph attention layers (PGAL) and pairwise interactive pooling (PiPool). Specifically, PGAL iteratively performs the node-edge aggregation process to update embeddings of nodes and edges while preserving the distance and angle information among atoms. Then, PiPool is adopted to gather interactive edges with a subsequent reconstruction loss to reflect the global interactions. Exhaustive experimental study on two benchmarks verifies the superiority of SIGN.