Xia Zhu

LG
h-index10
14papers
1,145citations
Novelty45%
AI Score47

14 Papers

ARMar 28Code
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

Jinwu Yang, Jiaan Wu, Zedong Liu et al.

The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.

LGMar 2, 2023
Interpretable System Identification and Long-term Prediction on Time-Series Data

Xiaoyi Liu, Duxin Chen, Wenjia Wei et al.

Time-series prediction has drawn considerable attention during the past decades fueled by the emerging advances of deep learning methods. However, most neural network based methods lack interpretability and fail in extracting the hidden mechanism of the targeted physical system. To overcome these shortcomings, an interpretable sparse system identification method without any prior knowledge is proposed in this study. This method adopts the Fourier transform to reduces the irrelevant items in the dictionary matrix, instead of indiscriminate usage of polynomial functions in most system identification methods. It shows an interpretable system representation and greatly reduces computing cost. With the adoption of $l_1$ norm in regularizing the parameter matrix, a sparse description of the system model can be achieved. Moreover, Three data sets including the water conservancy data, global temperature data and financial data are used to test the performance of the proposed method. Although no prior knowledge was known about the physical background, experimental results show that our method can achieve long-term prediction regardless of the noise and incompleteness in the original data more accurately than the widely-used baseline data-driven methods. This study may provide some insight into time-series prediction investigations, and suggests that an white-box system identification method may extract the easily overlooked yet inherent periodical features and may beat neural-network based black-box methods on long-term prediction tasks.

LGNov 18, 2025Code
CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Jiawei Yi, Ping Gong, Youhui Bai et al.

The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.

LGMar 13, 2021Code
Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting

Defu Cao, Yujing Wang, Juanyong Duan et al.

Multivariate time-series forecasting plays a crucial role in many real-world applications. It is a challenging problem as one needs to consider both intra-series temporal correlations and inter-series correlations simultaneously. Recently, there have been multiple works trying to capture both correlations, but most, if not all of them only capture temporal correlations in the time domain and resort to pre-defined priors as inter-series relationships. In this paper, we propose Spectral Temporal Graph Neural Network (StemGNN) to further improve the accuracy of multivariate time-series forecasting. StemGNN captures inter-series correlations and temporal dependencies \textit{jointly} in the \textit{spectral domain}. It combines Graph Fourier Transform (GFT) which models inter-series correlations and Discrete Fourier Transform (DFT) which models temporal dependencies in an end-to-end framework. After passing through GFT and DFT, the spectral representations hold clear patterns and can be predicted effectively by convolution and sequential learning modules. Moreover, StemGNN learns inter-series correlations automatically from the data without using pre-defined priors. We conduct extensive experiments on ten real-world datasets to demonstrate the effectiveness of StemGNN. Code is available at https://github.com/microsoft/StemGNN/

LGFeb 17, 2024
Maintaining Adversarial Robustness in Continuous Learning

Xiaolei Ru, Xiaowei Cao, Zijia Liu et al.

Adversarial robustness is essential for security and reliability of machine learning systems. However, adversarial robustness enhanced by defense algorithms is easily erased as the neural network's weights update to learn new tasks. To address this vulnerability, it is essential to improve the capability of neural networks in terms of robust continual learning. Specially, we propose a novel gradient projection technique that effectively stabilizes sample gradients from previous data by orthogonally projecting back-propagation gradients onto a crucial subspace before using them for weight updates. This technique can maintaining robustness by collaborating with a class of defense algorithms through sample gradient smoothing. The experimental results on four benchmarks including Split-CIFAR100 and Split-miniImageNet, demonstrate that the superiority of the proposed approach in mitigating rapidly degradation of robustness during continual learning even when facing strong adversarial attacks.

CVJun 4, 2020
MFPP: Morphological Fragmental Perturbation Pyramid for Black-Box Model Explanations

Qing Yang, Xia Zhu, Jong-Kae Fwu et al.

Deep neural networks (DNNs) have recently been applied and used in many advanced and diverse tasks, such as medical diagnosis, automatic driving, etc. Due to the lack of transparency of the deep models, DNNs are often criticized for their prediction that cannot be explainable by human. In this paper, we propose a novel Morphological Fragmental Perturbation Pyramid (MFPP) method to solve the Explainable AI problem. In particular, we focus on the black-box scheme, which can identify the input area that is responsible for the output of the DNN without having to understand the internal architecture of the DNN. In the MFPP method, we divide the input image into multi-scale fragments and randomly mask out fragments as perturbation to generate a saliency map, which indicates the significance of each pixel for the prediction result of the black box model. Compared with the existing input sampling perturbation method, the pyramid structure fragment has proved to be more effective. It can better explore the morphological information of the input image to match its semantic information, and does not need any value inside the DNN. We qualitatively and quantitatively prove that MFPP meets and exceeds the performance of state-of-the-art (SOTA) black-box interpretation method on multiple DNN models and datasets.

APMay 11, 2020
Incorporating structured assumptions with probabilistic graphical models in fMRI data analysis

Ming Bo Cai, Michael Shvartsman, Anqi Wu et al.

With the wide adoption of functional magnetic resonance imaging (fMRI) by cognitive neuroscience researchers, large volumes of brain imaging data have been accumulated in recent years. Aggregating these data to derive scientific insights often faces the challenge that fMRI data are high-dimensional, heterogeneous across people, and noisy. These challenges demand the development of computational tools that are tailored both for the neuroscience questions and for the properties of the data. We review a few recently developed algorithms in various domains of fMRI research: fMRI in naturalistic tasks, analyzing full-brain functional connectivity, pattern classification, inferring representational similarity and modeling structured residuals. These algorithms all tackle the challenges in fMRI similarly: they start by making clear statements of assumptions about neural data and existing domain knowledge, incorporating those assumptions and domain knowledge into probabilistic graphical models, and using those models to estimate properties of interest or latent structures in the data. Such approaches can avoid erroneous findings, reduce the impact of noise, better utilize known properties of the data, and better aggregate data across groups of subjects. With these successful cases, we advocate wider adoption of explicit model construction in cognitive neuroscience. Although we focus on fMRI, the principle illustrated here is generally applicable to brain data of other modalities.

CVApr 24, 2020
PipeNet: Selective Modal Pipeline of Fusion Network for Multi-Modal Face Anti-Spoofing

Qing Yang, Xia Zhu, Jong-Kae Fwu et al.

Face anti-spoofing has become an increasingly important and critical security feature for authentication systems, due to rampant and easily launchable presentation attacks. Addressing the shortage of multi-modal face dataset, CASIA recently released the largest up-to-date CASIA-SURF Cross-ethnicity Face Anti-spoofing(CeFA) dataset, covering 3 ethnicities, 3 modalities, 1607 subjects, and 2D plus 3D attack types in four protocols, and focusing on the challenge of improving the generalization capability of face anti-spoofing in cross-ethnicity and multi-modal continuous data. In this paper, we propose a novel pipeline-based multi-stream CNN architecture called PipeNet for multi-modal face anti-spoofing. Unlike previous works, Selective Modal Pipeline (SMP) is designed to enable a customized pipeline for each data modality to take full advantage of multi-modal data. Limited Frame Vote (LFV) is designed to ensure stable and accurate prediction for video classification. The proposed method wins the third place in the final ranking of Chalearn Multi-modal Cross-ethnicity Face Anti-spoofing Recognition Challenge@CVPR2020. Our final submission achieves the Average Classification Error Rate (ACER) of 2.21 with Standard Deviation of 1.26 on the test set.

CVMar 10, 2020
Channel Pruning via Optimal Thresholding

Yun Ye, Ganmei You, Jong-Kae Fwu et al.

Structured pruning, especially channel pruning is widely used for the reduced computational cost and the compatibility with off-the-shelf hardware devices. Among existing works, weights are typically removed using a predefined global threshold, or a threshold computed from a predefined metric. The predefined global threshold based designs ignore the variation among different layers and weights distribution, therefore, they may often result in sub-optimal performance caused by over-pruning or under-pruning. In this paper, we present a simple yet effective method, termed Optimal Thresholding (OT), to prune channels with layer dependent thresholds that optimally separate important from negligible channels. By using OT, most negligible or unimportant channels are pruned to achieve high sparsity while minimizing performance degradation. Since most important weights are preserved, the pruned model can be further fine-tuned and quickly converge with very few iterations. Our method demonstrates superior performance, especially when compared to the state-of-the-art designs at high levels of sparsity. On CIFAR-100, a pruned and fine-tuned DenseNet-121 by using OT achieves 75.99% accuracy with only 1.46e8 FLOPs and 0.71M parameters.

LGSep 4, 2018
Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-out Classifiers

Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu et al.

As deep learning methods form a critical part in commercially important applications such as autonomous driving and medical diagnostics, it is important to reliably detect out-of-distribution (OOD) inputs while employing these algorithms. In this work, we propose an OOD detection algorithm which comprises of an ensemble of classifiers. We train each classifier in a self-supervised manner by leaving out a random subset of training data as OOD data and the rest as in-distribution (ID) data. We propose a novel margin-based loss over the softmax output which seeks to maintain at least a margin $m$ between the average entropy of the OOD and in-distribution samples. In conjunction with the standard cross-entropy loss, we minimize the novel loss to train an ensemble of classifiers. We also propose a novel method to combine the outputs of the ensemble of classifiers to obtain OOD detection score and class prediction. Overall, our method convincingly outperforms Hendrycks et al.[7] and the current state-of-the-art ODIN[13] on several OOD detection benchmarks.

CVNov 17, 2017
Segmenting Brain Tumors with Symmetry

Hejia Zhang, Xia Zhu, Theodore L. Willke

We explore encoding brain symmetry into a neural network for a brain tumor segmentation task. A healthy human brain is symmetric at a high level of abstraction, and the high-level asymmetric parts are more likely to be tumor regions. Paying more attention to asymmetries has the potential to boost the performance in brain tumor segmentation. We propose a method to encode brain symmetry into existing neural networks and apply the method to a state-of-the-art neural network for medical imaging segmentation. We evaluate our symmetry-encoded network on the dataset from a brain tumor segmentation challenge and verify that the new model extracts information in the training images more efficiently than the original model.

MLSep 29, 2016
A Searchlight Factor Model Approach for Locating Shared Information in Multi-Subject fMRI Analysis

Hejia Zhang, Po-Hsuan Chen, Janice Chen et al.

There is a growing interest in joint multi-subject fMRI analysis. The challenge of such analysis comes from inherent anatomical and functional variability across subjects. One approach to resolving this is a shared response factor model. This assumes a shared and time synchronized stimulus across subjects. Such a model can often identify shared information, but it may not be able to pinpoint with high resolution the spatial location of this information. In this work, we examine a searchlight based shared response model to identify shared information in small contiguous regions (searchlights) across the whole brain. Validation using classification tasks demonstrates that we can pinpoint informative local regions.

MLAug 17, 2016
A Convolutional Autoencoder for Multi-Subject fMRI Data Aggregation

Po-Hsuan Chen, Xia Zhu, Hejia Zhang et al.

Finding the most effective way to aggregate multi-subject fMRI data is a long-standing and challenging problem. It is of increasing interest in contemporary fMRI studies of human cognition due to the scarcity of data per subject and the variability of brain anatomy and functional response across subjects. Recent work on latent factor models shows promising results in this task but this approach does not preserve spatial locality in the brain. We examine two ways to combine the ideas of a factor model and a searchlight based analysis to aggregate multi-subject fMRI data while preserving spatial locality. We first do this directly by combining a recent factor method known as a shared response model with searchlight analysis. Then we design a multi-view convolutional autoencoder for the same task. Both approaches preserve spatial locality and have competitive or better performance compared with standard searchlight analysis and the shared response model applied across the whole brain. We also report a system design to handle the computational challenge of training the convolutional autoencoder.

MLAug 16, 2016
Enabling Factor Analysis on Thousand-Subject Neuroimaging Datasets

Michael J. Anderson, Mihai Capotă, Javier S. Turek et al.

The scale of functional magnetic resonance image data is rapidly increasing as large multi-subject datasets are becoming widely available and high-resolution scanners are adopted. The inherent low-dimensionality of the information in this data has led neuroscientists to consider factor analysis methods to extract and analyze the underlying brain activity. In this work, we consider two recent multi-subject factor analysis methods: the Shared Response Model and Hierarchical Topographic Factor Analysis. We perform analytical, algorithmic, and code optimization to enable multi-node parallel implementations to scale. Single-node improvements result in 99x and 1812x speedups on these two methods, and enables the processing of larger datasets. Our distributed implementations show strong scaling of 3.3x and 5.5x respectively with 20 nodes on real datasets. We also demonstrate weak scaling on a synthetic dataset with 1024 subjects, on up to 1024 nodes and 32,768 cores.