Ping Xu

LG
h-index38
17papers
249citations
Novelty41%
AI Score53

17 Papers

LGMay 26
Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

Kukyoung Jang, Taehyun Cho, Junrui Zhang et al.

Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a general smoothing framework that combines flexible symmetric unimodal kernels with monotonic ratio-based transformations. Under mild conditions, we show that the smoothed objective preserves the global maximizer and that all stationary points concentrate near the true optimum for sufficiently large amplification, without requiring a decreasing smoothing schedule. We further provide explicit complexity bounds for stochastic gradient ascent and show that a leave-one-out baseline provably reduces variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness and competitive performance.

DBOct 18, 2023
A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

Le Ma, Ran Zhang, Yikun Han et al.

Vector databases (VDBs) have emerged to manage high-dimensional data that exceed the capabilities of traditional database management systems, and are now tightly integrated with large language models as well as widely applied in modern artificial intelligence systems. Although relatively few studies describe existing or introduce new vector database architectures, the core technologies underlying VDBs, such as approximate nearest neighbor search, have been extensively studied and are well documented in the literature. In this work, we present a comprehensive review of the relevant algorithms to provide a general understanding of this booming research area. Specifically, we first provide a review of storage and retrieval techniques in VDBs, with detailed design principles and technological evolution. Then, we conduct an in-depth comparison of several advanced VDB solutions with their strengths, limitations, and typical application scenarios. Finally, we also outline emerging opportunities for coupling VDBs with large language models, including open research problems and trends, such as novel indexing strategies. This survey aims to serve as a practical resource, enabling readers to quickly gain an overall understanding of the current knowledge landscape in this rapidly developing area.

LGOct 29, 2022
Robust Distributed Learning Against Both Distributional Shifts and Byzantine Attacks

Guanqiang Zhou, Ping Xu, Yue Wang et al.

In distributed learning systems, robustness issues may arise from two sources. On one hand, due to distributional shifts between training data and test data, the trained model could exhibit poor out-of-sample performance. On the other hand, a portion of working nodes might be subject to byzantine attacks which could invalidate the learning result. Existing works mostly deal with these two issues separately. In this paper, we propose a new algorithm that equips distributed learning with robustness measures against both distributional shifts and byzantine attacks. Our algorithm is built on recent advances in distributionally robust optimization as well as norm-based screening (NBS), a robust aggregation scheme against byzantine attacks. We provide convergence proofs in three cases of the learning model being nonconvex, convex, and strongly convex for the proposed algorithm, shedding light on its convergence behaviors and endurability against byzantine attacks. In particular, we deduce that any algorithm employing NBS (including ours) cannot converge when the percentage of byzantine nodes is 1/3 or higher, instead of 1/2, which is the common belief in current literature. The experimental results demonstrate the effectiveness of our algorithm against both robustness issues. To the best of our knowledge, this is the first work to address distributional shifts and byzantine attacks simultaneously.

LGAug 4, 2022
QC-ODKLA: Quantized and Communication-Censored Online Decentralized Kernel Learning via Linearized ADMM

Ping Xu, Yue Wang, Xiang Chen et al.

This paper focuses on online kernel learning over a decentralized network. Each agent in the network receives continuous streaming data locally and works collaboratively to learn a nonlinear prediction function that is globally optimal in the reproducing kernel Hilbert space with respect to the total instantaneous costs of all agents. In order to circumvent the curse of dimensionality issue in traditional online kernel learning, we utilize random feature (RF) mapping to convert the non-parametric kernel learning problem into a fixed-length parametric one in the RF space. We then propose a novel learning framework named Online Decentralized Kernel learning via Linearized ADMM (ODKLA) to efficiently solve the online decentralized kernel learning problem. To further improve the communication efficiency, we add the quantization and censoring strategies in the communication stage and develop the Quantized and Communication-censored ODKLA (QC-ODKLA) algorithm. We theoretically prove that both ODKLA and QC-ODKLA can achieve the optimal sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. Through numerical experiments, we evaluate the learning effectiveness, communication, and computation efficiencies of the proposed methods.

GNDec 2, 2025
scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing

Ping Xu, Zaitian Wang, Zhirui Wang et al.

Cell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks for scRNA-seq clustering methods remain fragmented, often lacking standardized protocols and failing to incorporate recent advances in artificial intelligence. To fill these gaps, we present scCluBench, a comprehensive benchmark of clustering algorithms for scRNA-seq data. First, scCluBench provides 36 scRNA-seq datasets collected from diverse public sources, covering multiple tissues, which are uniformly processed and standardized to ensure consistency for systematic evaluation and downstream analyses. To evaluate performance, we collect and reproduce a range of scRNA-seq clustering methods, including traditional, deep learning-based, graph-based, and biological foundation models. We comprehensively evaluate each method both quantitatively and qualitatively, using core performance metrics as well as visualization analyses. Furthermore, we construct representative downstream biological tasks, such as marker gene identification and cell type annotation, to further assess the practical utility. scCluBench then investigates the performance differences and applicability boundaries of various clustering models across diverse analytical tasks, systematically assessing their robustness and scalability in real-world scenarios. Overall, scCluBench offers a standardized and user-friendly benchmark for scRNA-seq clustering, with curated datasets, unified evaluation protocols, and transparent analyses, facilitating informed method selection and providing valuable insights into model generalizability and application scope.

LGApr 9, 2024Code
scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding

Ping Xu, Zhiyuan Ning, Meng Xiao et al.

Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular heterogeneity and diversity, offering invaluable insights for bioinformatics advancements. Despite its potential, traditional clustering methods in scRNA-seq data analysis often neglect the structural information embedded in gene expression profiles, crucial for understanding cellular correlations and dependencies. Existing strategies, including graph neural networks, face challenges in handling the inefficiency due to scRNA-seq data's intrinsic high-dimension and high-sparsity. Addressing these limitations, we introduce scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel framework designed for efficient and accurate clustering of scRNA-seq data that simultaneously utilizes intercellular high-order structural information. scCDCG comprises three main components: (i) A graph embedding module utilizing deep cut-informed techniques, which effectively captures intercellular high-order structural information, overcoming the over-smoothing and inefficiency issues prevalent in prior graph neural network methods. (ii) A self-supervised learning module guided by optimal transport, tailored to accommodate the unique complexities of scRNA-seq data, specifically its high-dimension and high-sparsity. (iii) An autoencoder-based feature learning module that simplifies model complexity through effective dimension reduction and feature extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's superior performance and efficiency compared to 7 established models, underscoring scCDCG's potential as a transformative tool in scRNA-seq data analysis. Our code is available at: https://github.com/XPgogogo/scCDCG.

LGAug 12, 2024
Approximating Discrimination Within Models When Faced With Several Non-Binary Sensitive Attributes

Yijun Bian, Yujie Luo, Ping Xu

Discrimination mitigation within machine learning (ML) models could be complicated because multiple factors may be interwoven hierarchically and historically. Yet few existing fairness measures can capture the discrimination level within ML models in the face of multiple sensitive attributes (SAs). To bridge this gap, we propose a fairness measure based on distances between sets from a manifold perspective, named as 'Harmonic Fairness measure via Manifolds (HFM)' with two optional versions, which can deal with a fine-grained discrimination evaluation for several SAs of multiple values. Because directly computing HFM may be costly, to accelerate its subprocedure -- the computation of distances of sets, we further propose two approximation algorithms named 'Approximation of distance between sets for one sensitive attribute with multiple values (ApproxDist)' and 'Approximation of extended distance between sets for several sensitive attributes with multiple values (ExtendDist)' to respectively resolve bias evaluation of one single SA with multiple values and that of several SAs with multiple values. Moreover, we provide an algorithmic effectiveness analysis for ApproxDist under certain assumptions to explain how well it could work. The empirical results demonstrate that our proposed fairness measure HFM is valid and approximation algorithms (i.e. ApproxDist and ExtendDist) are effective and efficient.

CVJan 5
Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems

Yian Liu, Xiong Wang, Ping Xu et al.

Real-time lane detection in embedded systems encounters significant challenges due to subtle and sparse visual signals in RGB images, often constrained by limited computational resources and power consumption. Although deep learning models for lane detection categorized into segmentation-based, anchor-based, and curve-based methods there remains a scarcity of universally applicable optimization techniques tailored for low-power embedded environments. To overcome this, we propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications. The CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity. Evaluations were conducted on six diverse models across all three method categories, including two optimized for real-time applications and four state-of-the-art (SOTA) models, tested comprehensively on three major datasets: CULane, TuSimple, and LLAMAS. Experimental results demonstrate accuracy improvements ranging from 0.01% to 1.5%. The proposed CDO module is characterized by ease of integration into existing systems without structural modifications and utilizes existing model parameters to facilitate ongoing training, thus offering substantial benefits in performance, power efficiency, and operational flexibility in embedded systems.

LGDec 12, 2025
DFedReweighting: A Unified Framework for Objective-Oriented Reweighting in Decentralized Federated Learning

Kaichuang Zhang, Wei Yin, Jinghao Yang et al.

Decentralized federated learning (DFL) has recently emerged as a promising paradigm that enables multiple clients to collaboratively train machine learning model through iterative rounds of local training, communication, and aggregation without relying on a central server which introduces potential vulnerabilities in conventional Federated Learning. Nevertheless, DFL systems continue to face a range of challenges, including fairness, robustness, etc. To address these challenges, we propose \textbf{DFedReweighting}, a unified aggregation framework designed to achieve diverse objectives in DFL systems via a objective-oriented reweighting aggregation at the final step of each learning round. Specifically, the framework first computes preliminary weights based on \textit{target performance metric} obtained from auxiliary dataset constructed using local data. These weights are then refined using \textit{customized reweighting strategy}, resulting in the final aggregation weights. Our results from the theoretical analysis demonstrate that the appropriate combination of the target performance metric and the customized reweighting strategy ensures linear convergence. Experimental results consistently show that our proposed framework significantly improves fairness and robustness against Byzantine attacks in diverse scenarios. Provided that appropriate target performance metrics and customized reweighting strategy are selected, our framework can achieve a wide range of desired learning objectives.

LGMar 9, 2025
Deep Cut-informed Graph Embedding and Clustering

Zhiyuan Ning, Zaitian Wang, Ran Zhang et al.

Graph clustering aims to divide the graph into different clusters. The recently emerging deep graph clustering approaches are largely built on graph neural networks (GNN). However, GNN is designed for general graph encoding and there is a common issue of representation collapse in existing GNN-based deep graph clustering algorithms. We attribute two main reasons for such issues: (i) the inductive bias of GNN models: GNNs tend to generate similar representations for proximal nodes. Since graphs often contain a non-negligible amount of inter-cluster links, the bias results in error message passing and leads to biased clustering; (ii) the clustering guided loss function: most traditional approaches strive to make all samples closer to pre-learned cluster centers, which causes a degenerate solution assigning all data points to a single label thus making all samples similar and less discriminative. To address these challenges, we investigate graph clustering from a graph cut perspective and propose an innovative and non-GNN-based Deep Cut-informed Graph embedding and Clustering framework, namely DCGC. This framework includes two modules: (i) cut-informed graph encoding; (ii) self-supervised graph clustering via optimal transport. For the encoding module, we derive a cut-informed graph embedding objective to fuse graph structure and attributes by minimizing their joint normalized cut. For the clustering module, we utilize the optimal transport theory to obtain the clustering assignments, which can balance the guidance of "proximity to the pre-learned cluster center". With the above two tailored designs, DCGC is more suitable for the graph clustering task, which can effectively alleviate the problem of representation collapse and achieve better performance. We conduct extensive experiments to demonstrate that our method is simple but effective compared with benchmarks.

GNMay 19, 2025
scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data

Ping Xu, Zhiyuan Ning, Pengjiang Li et al.

Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that scSiameseClu outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.

LGJul 14, 2025
Soft Graph Clustering for single-cell RNA Sequencing Data

Ping Xu, Pengfei Wang, Zhiyuan Ning et al.

Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, their reliance on hard graph constructions derived from thresholded similarity matrices presents challenges:(i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss.(ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes. To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder; (ii) a dual-channel cut-informed soft graph embedding module; and (iii) an optimal transport-based clustering optimization module. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.

OTFeb 21, 2025
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence

Yingying Sun, Jun A, Zhiwei Liu et al.

Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.

GNSep 30, 2025
scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis

Ping Xu, Zaitian Wang, Zhirui Wang et al.

Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.

LGMar 5, 2025
Towards Trustworthy Federated Learning

Alina Basharat, Yijun Bian, Ping Xu et al.

This paper develops a comprehensive framework to address three critical trustworthy challenges in federated learning (FL): robustness against Byzantine attacks, fairness, and privacy preservation. To improve the system's defense against Byzantine attacks that send malicious information to bias the system's performance, we develop a Two-sided Norm Based Screening (TNBS) mechanism, which allows the central server to crop the gradients that have the l lowest norms and h highest norms. TNBS functions as a screening tool to filter out potential malicious participants whose gradients are far from the honest ones. To promote egalitarian fairness, we adopt the q-fair federated learning (q-FFL). Furthermore, we adopt a differential privacy-based scheme to prevent raw data at local clients from being inferred by curious parties. Convergence guarantees are provided for the proposed framework under different scenarios. Experimental results on real datasets demonstrate that the proposed framework effectively improves robustness and fairness while managing the trade-off between privacy and accuracy. This work appears to be the first study that experimentally and theoretically addresses fairness, privacy, and robustness in trustworthy FL.

LGJan 28, 2020
COKE: Communication-Censored Decentralized Kernel Learning

Ping Xu, Yue Wang, Xiang Chen et al.

This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert space by jointly minimizing a global objective function, with access to their own locally observed dataset. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge, we leverage the random feature (RF) approximation approach to enable consensus on the function modeled in the RF space by data-independent parameters across different agents. We then design an iterative algorithm, termed DKLA, for fast-convergent implementation via ADMM. Based on DKLA, we further develop a communication-censored kernel learning (COKE) algorithm that reduces the communication load of DKLA by preventing an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests on both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.

QUANT-PHDec 16, 2019
Variational Quantum Circuits for Quantum State Tomography

Yong Liu, Dongyang Wang, Shichuan Xue et al.

Quantum state tomography is a key process in most quantum experiments. In this work, we employ quantum machine learning for state tomography. Given an unknown quantum state, it can be learned by maximizing the fidelity between the output of a variational quantum circuit and this state. The number of parameters of the variational quantum circuit grows linearly with the number of qubits and the circuit depth, so that only polynomial measurements are required, even for highly-entangled states. After that, a subsequent classical circuit simulator is used to transform the information of the target quantum state from the variational quantum circuit into a familiar format. We demonstrate our method by performing numerical simulations for the tomography of the ground state of a one-dimensional quantum spin chain, using a variational quantum circuit simulator. Our method is suitable for near-term quantum computing platforms, and could be used for relatively large-scale quantum state tomography for experimentally relevant quantum states.