Sheng Di

LG
h-index34
24papers
309citations
Novelty56%
AI Score57

24 Papers

68.1DCJun 1
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

Yafan Huang, Sheng Di, Guanpeng Li

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.

85.8LGJun 1
Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz et al.

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

LGSep 7, 2023
SRN-SZ: Deep Leaning-Based Scientific Error-bounded Lossy Compression with Super-resolution Neural Networks

Jinyang Liu, Sheng Di, Sian Jin et al.

The fast growth of computational power and scales of modern super-computing systems have raised great challenges for the management of exascale scientific data. To maintain the usability of scientific data, error-bound lossy compression is proposed and developed as an essential technique for the size reduction of scientific data with constrained data distortion. Among the diverse datasets generated by various scientific simulations, certain datasets cannot be effectively compressed by existing error-bounded lossy compressors with traditional techniques. The recent success of Artificial Intelligence has inspired several researchers to integrate neural networks into error-bounded lossy compressors. However, those works still suffer from limited compression ratios and/or extremely low efficiencies. To address those issues and improve the compression on the hard-to-compress datasets, in this paper, we propose SRN-SZ, which is a deep learning-based scientific error-bounded lossy compressor leveraging the hierarchical data grid expansion paradigm implemented by super-resolution neural networks. SRN-SZ applies the most advanced super-resolution network HAT for its compression, which is free of time-costing per-data training. In experiments compared with various state-of-the-art compressors, SRN-SZ achieves up to 75% compression ratio improvements under the same error bound and up to 80% compression ratio improvements under the same PSNR than the second-best compressor.

DCAug 2, 2024
FT K-means: A High-Performance K-means on GPU with Fault Tolerance

Shixun Wu, Yitong Ding, Yujia Zhai et al.

K-means is a widely used algorithm in clustering, however, its efficiency is primarily constrained by the computational cost of distance computing. Existing implementations suffer from suboptimal utilization of computational units and lack resilience against soft errors. To address these challenges, we introduce FT K-means, a high-performance GPU-accelerated implementation of K-means with online fault tolerance. We first present a stepwise optimization strategy that achieves competitive performance compared to NVIDIA's cuML library. We further improve FT K-means with a template-based code generation framework that supports different data types and adapts to different input shapes. A novel warp-level tensor-core error correction scheme is proposed to address the failure of existing fault tolerance methods due to memory asynchronization during copy operations. Our experimental evaluations on NVIDIA T4 GPU and A100 GPU demonstrate that FT K-means without fault tolerance outperforms cuML's K-means implementation, showing a performance increase of 10\%-300\% in scenarios involving irregular data shapes. Moreover, the fault tolerance feature of FT K-means introduces only an overhead of 11\%, maintaining robust performance even with tens of errors injected per second.

14.4LGApr 20
Preserving Clusters in Error-Bounded Lossy Compression of Particle Data

Congrong Ren, Sheng Di, Katrin Heitmann et al.

Lossy compression is widely used to reduce storage and I/O costs for large-scale particle datasets in scientific applications such as cosmology, molecular dynamics, and fluid dynamics, where clustering structures (e.g., single-linkage or Friends-of-Friends) are critical for downstream analysis; however, existing compressors typically provide only pointwise error bounds on particle positions and offer no guarantees on preserving clustering outcomes, and even small perturbations can alter cluster connectivity and compromise scientific validity. We propose a correction-based technique to preserve single-linkage clustering under lossy compression, operating on decompressed data from off-the-shelf compressors such as SZ3 and Draco. Our key contributions are threefold: (1) a clustering-aware correction algorithm that identifies vulnerable particle pairs via spatial partitioning and local neighborhood search; (2) an optimization-based formulation that enforces clustering consistency using projected gradient descent with a loss that encodes pairwise distance violations; and (3) a scalable GPU-accelerated and distributed implementation for large-scale datasets. Experiments on cosmology and molecular dynamics datasets show that our method effectively preserves clustering results while maintaining competitive compression performance compared with SZ3, ZFP, Draco, LCP, and space-filling-curve-based schemes.

59.2DBMar 26
Enabling Homomorphic Analytical Operations on Compressed Scientific Data with Multi-stage Decompression

Xuan Wu, Sheng Di, Tripti Agarwal et al.

Error-controlled lossy compressors have been widely used in scientific applications to reduce the unprecedented size of scientific data while keeping data distortion within a user-specified threshold. While they significantly mitigate the pressure for data storage and transmission, they prolong the time to access the data because decompression is required to transform the binary compressed data into meaningful floating-point numbers. This incurs noticeable overhead for common analytical operations on scientific data that extract or derive useful information, because the time cost of the operations could be much lower than that of decompression. In this work, we design an error-controlled lossy compression and analytical framework that features multi-stage decompression and homomorphic analytical operation algorithms on intermediate decompressed data for reduced data access latency. Our contributions are threefold. (1) We abstract a generic compression pipeline with partial decompression to multiple intermediate data representations and implement four instances based on state-of-the-art high-throughput scientific data compressors. (2) We carefully design homomorphic algorithms to enable direct operations on intermediate decompressed data for three types of analytical operations on scientific data. (3) We evaluate our approach using five real-world scientific datasets. Experimental evaluations demonstrate that our method achieves significant speedups when performing analytical operations on compressed scientific data across all three targeted analytical operation types.

90.8ITMar 10
Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

Sujata Sinha, Vishwas Rao, Robert Underwood et al.

Since Shannon's foundational work, rate-distortion theory has defined the fundamental limits of lossy compression. Classical results, derived for memoryless and stationary ergodic sources in the asymptotic regime, have shaped both transform and predictive coding architectures, as well as practical standards such as JPEG. Finite-blocklength refinements, initiated by the non-asymptotic achievability and converse bounds of Kostina and Verdu, provide precise characterizations under excess-distortion probability constraints, but primarily for memoryless or statistically homogeneous models. In contrast, error-bounded practical lossy compressors for scientific computing, such as SZ, ZFP, MGARD, and SPERR, are designed for finite, high-dimensional, spatially correlated, and statistically heterogeneous random fields. These compressors partition data into fixed-size tiles that are processed independently, making tile size a central architectural constraint. Structural heterogeneity, finite lattice effects, and tiling constraints are not addressed by existing finite-blocklength analyses. This paper introduces a finite-blocklength rate-distortion framework for heterogeneous random fields on finite lattices, explicitly accounting for the tile-based architectures used in high-performance scientific compressors. The field is modeled as piecewise homogeneous with regionwise stationary second-order statistics, and tiling constraints are incorporated directly into the source model. Under an excess-distortion probability criterion, we establish non-asymptotic achievability, converse bounds and derive a second-order expansion that quantifies the impact of spatial correlation, region geometry, heterogeneity, and tile size on the rate and dispersion.

CVFeb 11
HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

Yilin Yang, Zhenghui Guo, Yuke Wang et al.

Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

DCDec 30, 2025Code
PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression

Bo Jiang, Taolue Yang, Youyuan Liu et al.

Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present \textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational efficiency. Experimental results show that, under the same and minimum accuracy drop as state-of-the-art quantization methods, PackKV achieves, on average, \textbf{153.2}\% higher memory reduction rate for the K cache and \textbf{179.6}\% for the V cache. Furthermore, PackKV delivers extremely high execution throughput, effectively eliminating decompression overhead and accelerating the matrix-vector multiplication operation. Specifically, PackKV achieves an average throughput improvement of \textbf{75.7}\% for K and \textbf{171.7}\% for V across A100 and RTX Pro 6000 GPUs, compared to cuBLAS matrix-vector multiplication kernels, while demanding less GPU memory bandwidth. Code available on https://github.com/BoJiang03/PackKV

LGDec 24, 2025
DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction

Khondoker Mirazul Mumenin, Robert Underwood, Dong Dai et al.

Error-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10\% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis.

90.5DCMay 15
SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Jin Lee, Zhonghao Chen, Xuhang He et al.

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication.

LGFeb 26
LUMOS: Democratizing SciML Workflows with L0-Regularized Learning for Unified Feature and Parameter Adaptation

Shouwei Gao, Xu Zheng, Dongsheng Luo et al.

The rapid growth of scientific machine learning (SciML) has accelerated discovery across diverse domains, yet designing effective SciML models remains a challenging task. In practice, building such models often requires substantial prior knowledge and manual expertise, particularly in determining which input features to use and how large the model should be. We introduce LUMOS, an end-to-end framework based on L0-regularized learning that unifies feature selection and model pruning to democratize SciML model design. By employing semi-stochastic gating and reparameterization techniques, LUMOS dynamically selects informative features and prunes redundant parameters during training, reducing the reliance on manual tuning while maintaining predictive accuracy. We evaluate LUMOS across 13 diverse SciML workloads, including cosmology and molecular sciences, and demonstrate its effectiveness and generalizability. Experiments on 13 SciML models show that LUMOS achieves 71.45% parameter reduction and a 6.4x inference speedup on average. Furthermore, Distributed Data Parallel (DDP) training on up to eight GPUs confirms the scalability of

97.6DCMay 11
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

Ziyue Liu, Zhengyang Wang, Ruijie Zhang et al.

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

LGApr 17, 2024
FedFa: A Fully Asynchronous Training Paradigm for Federated Learning

Haotian Xu, Zhaorui Zhang, Sheng Di et al.

Federated learning has been identified as an efficient decentralized training paradigm for scaling the machine learning model training on a large number of devices while guaranteeing the data privacy of the trainers. FedAvg has become a foundational parameter update strategy for federated learning, which has been promising to eliminate the effect of the heterogeneous data across clients and guarantee convergence. However, the synchronization parameter update barriers for each communication round during the training significant time on waiting, slowing down the training procedure. Therefore, recent state-of-the-art solutions propose using semi-asynchronous approaches to mitigate the waiting time cost with guaranteed convergence. Nevertheless, emerging semi-asynchronous approaches are unable to eliminate the waiting time completely. We propose a full asynchronous training paradigm, called FedFa, which can guarantee model convergence and eliminate the waiting time completely for federated learning by using a few buffered results on the server for parameter updating. Further, we provide theoretical proof of the convergence rate for our proposed FedFa. Extensive experimental results indicate our approach effectively improves the training performance of federated learning by up to 6x and 4x speedup compared to the state-of-the-art synchronous and semi-asynchronous strategies while retaining high accuracy in both IID and Non-IID scenarios.

0.9DCApr 30
Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Milan Shah, Sheng Di, Michela Becchi

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as scientific applications like molecular dynamics simulations. While dense compute workloads have been thoroughly explored for the CS-3, its potential for sparse workloads has not been fully examined. Applications requiring sparse linear algebra kernels, such as GNNs, linear solvers, and recommendation systems, could achieve good performance on a dataflow accelerator like the CS-3. In this work, we explore two key sparse linear algebra kernels, sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM), on the Cerebras CS-3. We propose low-level CS-3 kernel designs for these operations and optimize our designs to improve I/O performance, memory footprint, and scalability to large matrices. Our evaluation examines memory footprint and SpMM/SDDMM speedup relative to CPU. The evaluation suggests that the CS-3 can outperform CPU by 100$\times$ for SpMM with 90\% sparse matrices with performance improving as sparse matrix dimensionality increases. SDDMM on CS-3 can outperform CPU 20$\times$ for 90\% sparse matrices. We additionally find that as sparsity increases to beyond 99\%, the CS-3 suffers from performance degradation that makes it slower than CPU for SpMM.

78.8DCApr 29
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning

Yimeng Shan, Zhaorui Zhang, Sheng Di et al.

Federated Split Learning has been identified as an efficient approach to address the computational resource constraints of clients in classical federated learning, while guaranteeing data privacy for distributed model training across data owners. However, it faces some critical challenges when such a training strategy meets large language models (LLMs) for fine-tuning. Such challenges include setting the cutlayer adaptively across different clients to address the data and device heterogeneity issues, which affect the system performance significantly. In addition, efficiently reducing the communication overhead during the fine-tuning procedure is also another challenge. No work tries to address these challenges. To bridge this gap, we propose SplitTF, an adaptive federated split learning system for LLMs fine-tuning. SplitFT enables different clients to set different cut layers according to their computation resources and trained model performance. SplitFT also proposes to reduce the LoRA rank in cutlayer to reduce the communication overhead. In addition to simulating the heterogeneous data in real-world applications for our proposed split federated learning system, we propose a length-based Dirichlet approach to divide the training data into different clients. Extensive experimental results show that our proposed approach outperforms the state-of-the-art approach for fine-tuning time efficiency and model performance based on various popular benchmarks.

LGMar 23, 2024
Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Robert Underwood, Jon C. Calhoun, Sheng Di et al.

Learning and Artificial Intelligence (ML/AI) techniques have become increasingly prevalent in high performance computing (HPC). However, these methods depend on vast volumes of floating point data for training and validation which need methods to share the data on a wide area network (WAN) or to transfer it from edge devices to data centers. Data compression can be a solution to these problems, but an in-depth understanding of how lossy compression affects model quality is needed. Prior work largely considers a single application or compression method. We designed a systematic methodology for evaluating data reduction techniques for ML/AI, and we use it to perform a very comprehensive evaluation with 17 data reduction methods on 7 ML/AI applications to show modern lossy compression methods can achieve a 50-100x compression ratio improvement for a 1% or less loss in quality. We identify critical insights that guide the future use and design of lossy compressors for ML/AI.

LGFeb 26, 2025
CLLoRA: An Approach to Measure the Effects of the Context Length for LLM Fine-Tuning

Ping Zhang, Zhaorui Zhang, Sheng Di et al.

Large language model fine-tuning has been identified as an efficient approach to applying the pre-trained Large language models to other domains. To guarantee data privacy for different data owners, models are often fine-tuned in federated learning environments across different data owners, which often involve data heterogeneity issues and affect the fine-tuning performance. In addition, the length of the context for the training data has been identified as a major factor that affects the LLM's model performance. To efficiently measure how the context length affects the LLM's model performance in heterogeneous federated learning environments, we propose CLLoRA. CLLoRA utilizes the parameter-efficient fine-tuning approach LoRA based on different kinds of LLMs with varying sizes as the fine-tuning approach to investigate whether the quality and length of contexts can serve as standards for measuring non-IID context. The findings indicate that an imbalance in context quality not only affects local training on clients but also impacts the global model's performance. However, context length has a minimal effect on local training but a more significant influence on the global model. These results provide insights into how context quality and length affect the model performance for LLM fine-tuning in federated learning environments.

CLAug 1, 2025
Systematic Evaluation of Optimization Techniques for Long-Context Language Models

Ammar Ahmed, Sheng Di, Franck Cappello et al.

Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.

AISep 26, 2025
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad et al.

Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.

LGSep 9, 2025
MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?

Songkai Ma, Zhaorui Zhang, Sheng Di et al.

With the widespread application of Mixture of Experts (MoE) reasoning models in the field of LLM learning, efficiently serving MoE models under limited GPU memory constraints has emerged as a significant challenge. Offloading the non-activated experts to main memory has been identified as an efficient approach to address such a problem, while it brings the challenges of transferring the expert between the GPU memory and main memory. We need to explore an efficient approach to compress the expert and analyze how the compression error affects the inference performance. To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy. The results indicate that experts in the shallow layers, which are primarily responsible for the attention mechanism and the transformation of input tokens into vector representations, exhibit minimal degradation in inference accuracy when subjected to bounded errors. In contrast, errors in the middle-layer experts, which are central to model reasoning, significantly impair inference accuracy. Interestingly, introducing bounded errors in the deep-layer experts, which are mainly responsible for instruction following and output integration, can sometimes lead to improvements in inference accuracy.

LGMay 25, 2021
Exploring Autoencoder-based Error-bounded Compression for Scientific Data

Jinyang Liu, Sheng Di, Kai Zhao et al.

Error-bounded lossy compression is becoming an indispensable technique for the success of today's scientific projects with vast volumes of data produced during simulations or instrument data acquisitions. Not only can it significantly reduce data size, but it also can control the compression errors based on user-specified error bounds. Autoencoder (AE) models have been widely used in image compression, but few AE-based compression approaches support error-bounding features, which are highly required by scientific applications. To address this issue, we explore using convolutional autoencoders to improve error-bounded lossy compression for scientific data, with the following three key contributions. (1) We provide an in-depth investigation of the characteristics of various autoencoder models and develop an error-bounded autoencoder-based framework in terms of the SZ model. (2) We optimize the compression quality for the main stages in our designed AE-based error-bounded compression framework, fine-tuning the block sizes and latent sizes and also optimizing the compression efficiency of latent vectors. (3) We evaluate our proposed solution using five real-world scientific datasets and compare them with six other related works. Experiments show that our solution exhibits a very competitive compression quality among all the compressors in our tests. In absolute terms, it can obtain a much better compression quality (100% ~ 800% improvement in compression ratio with the same data distortion) compared with SZ2.1 and ZFP in cases with a high compression ratio.

DCMar 27, 2020
FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Kai Zhao, Sheng Di, Sihuan Li et al.

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly.Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).

CVJan 26, 2019
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression

Sian Jin, Sheng Di, Xin Liang et al.

DNNs have been quickly and broadly exploited to improve the data analysis quality in many complex science and engineering applications. Today's DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources, significantly restricting their utilization on resource-constrained systems. Although some network simplification methods have been proposed to address this issue, they suffer from either low compression ratios or high compression errors, which may introduce a costly retraining process for the target accuracy. In this paper, we propose DeepSZ: an accuracy-loss bounded neural network compression framework, which involves four key steps: network pruning, error bound assessment, optimization for error bound configuration, and compressed model generation, featuring a high compression ratio and low encoding time. The contribution is three-fold. (1) We develop an adaptive approach to select the feasible error bounds for each layer. (2) We build a model to estimate the overall loss of accuracy based on the accuracy degradation caused by individual decompressed layers. (3) We develop an efficient optimization algorithm to determine the best-fit configuration of error bounds in order to maximize the compression ratio under the user-set accuracy constraint. Experiments show that DeepSZ can compress AlexNet and VGG-16 on the ImageNet by a compression ratio of 46X and 116X, respectively, and compress LeNet-300-100 and LeNet-5 on the MNIST by a compression ratio of 57X and 56X, respectively, with only up to 0.3% loss of accuracy. Compared with other state-of-the-art methods, DeepSZ can improve the compression ratio by up to 1.43X, the DNN encoding performance by up to 4.0X (with four Nvidia Tesla V100 GPUs), and the decoding performance by up to 6.2X.