Shaowei Zhu

h-index69

9papers

37citations

Novelty56%

AI Score55

Ranked #24,013 of 201,326 authors (top 12%)#10,173 in CV (top 17%)

9 Papers

DCJun 3

Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Yile Gu, Zhen Zhang, Shaowei Zhu et al.

LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.

CVDec 16, 2022

Adversarial Example Defense via Perturbation Grading Strategy

Shaowei Zhu, Wanli Lyu, Bin Li et al.

Deep Neural Networks have been widely used in many fields. However, studies have shown that DNNs are easily attacked by adversarial examples, which have tiny perturbations and greatly mislead the correct judgment of DNNs. Furthermore, even if malicious attackers cannot obtain all the underlying model parameters, they can use adversarial examples to attack various DNN-based task systems. Researchers have proposed various defense methods to protect DNNs, such as reducing the aggressiveness of adversarial examples by preprocessing or improving the robustness of the model by adding modules. However, some defense methods are only effective for small-scale examples or small perturbations but have limited defense effects for adversarial examples with large perturbations. This paper assigns different defense strategies to adversarial perturbations of different strengths by grading the perturbations on the input examples. Experimental results show that the proposed method effectively improves defense performance. In addition, the proposed method does not modify any task model, which can be used as a preprocessing module, which significantly reduces the deployment cost in practical applications.

LGNov 19, 2025Code

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang, Shaowei Zhu, Anjiang Wei et al. · stanford

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.

PLApr 13

A Categorical Basis for Robust Program Analysis

Zachary Kincaid, Shaowei Zhu

Users of program analyses expect that results change predictably in response to changes in their programs, but many analyses fail to provide such robustness. This paper introduces a theoretical framework that provides a unified language to articulate robustness properties. By modeling programs and their properties as objects in a category, diverse notions of robustness-from variable renaming to semantic refinement and structural transformation-can be characterized as structure-preserving functors. Beyond formulating the meaning of robustness, this paper provides methods for achieving it. The first is a general recipe for designing robust analyses, by lifting a sound and robust analysis from a restricted (sub-Turing) model of computation to a sound and robust analysis for general programs. This recipe demystifies the design of several existing loop summarization and termination analyses by showing they are instantiations of this general recipe, and furthermore elucidates their robustness properties. The second is a characterization of a sense in which an algebraic program analysis is robust, provided that it is comprised of robust operators. In particular, we show that such analyses behave predictably under common refactoring patterns, such as variable renaming and loop unrolling.

SEOct 18, 2025

TritonRL: Training LLMs to Think and Code Triton Without Cheating

Jiin Woo, Shaowei Zhu, Allen Nie et al.

With the rapid evolution of large language models (LLMs), the demand for automated, high-performance system kernels has emerged as a key enabler for accelerating development and deployment. We introduce TritonRL, a domain-specialized LLM for Triton kernel generation, trained with a novel training framework that enables robust and automated kernel synthesis. Unlike general-purpose programming languages, Triton kernel generation faces unique challenges due to data scarcity and incomplete evaluation criteria, vulnerable to reward hacking. Our approach addresses these challenges end-to-end by distilling Triton-specific knowledge through supervised fine-tuning on curated datasets, and further improving code quality via reinforcement learning (RL) with robust, verifiable rewards and hierarchical reward assignment. Our RL framework robustly detects reward hacking and guides both reasoning traces and code tokens through fine-grained verification and hierarchical reward decomposition, enabling the model to generate high-quality Triton kernels that can truly replace existing modules. With robust and fine-grained evaluation, our experiments on KernelBench demonstrate that TritonRL achieves state-of-the-art correctness and speedup, surpassing all other Triton-specific models and underscoring the effectiveness of our RL-based training paradigm.

DCJun 10, 2025

TTrace: Lightweight Error Checking and Diagnosis for Distributed Training

Haitian Jiang, Shaowei Zhu, Zhen Zhang et al.

Distributed training is essential for scaling the training of large neural network models, such as large language models (LLMs), across thousands of GPUs. However, the complexity of distributed training programs makes them particularly prone to silent bugs, which do not produce explicit error signal but lead to incorrect training outcome. Effectively detecting and localizing such silent bugs in distributed training is challenging. Common debugging practice using metrics like training loss or gradient norm curves can be inefficient and ineffective. Additionally, obtaining intermediate tensor values and determining whether they are correct during silent bug localization is difficult, particularly in the context of low-precision training. To address those challenges, we design and implement TTrace, the first system capable of detecting and localizing silent bugs in distributed training. TTrace collects intermediate tensors from distributing training in a fine-grained manner and compares them against those from a trusted single-device reference implementation. To properly compare the floating-point values in the tensors, we propose novel mathematical analysis that provides a guideline for setting thresholds, enabling TTrace to distinguish bug-induced errors from floating-point round-off errors. Experimental results demonstrate that TTrace effectively detects 11 existing bugs and 3 new bugs in the widely used Megatron-LM framework, while requiring fewer than 10 lines of code change. TTrace is effective in various training recipes, including low-precision recipes involving BF16 and FP8.

LGSep 12, 2025

Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks

Kahfi S. Zulkifli, Wenbo Qian, Shaowei Zhu et al.

Modern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity, introducing silent errors that severely degrade model performance. Existing solutions are either ad hoc or too costly for production. We present Scalify, a lightweight framework that exposes silent errors by verifying semantic equivalence of computational graphs using equality saturation and Datalog-style reasoning. To scale, Scalify partitions graphs with parallel rewriting and layer memoization, reuses rewrite templates, and augments equality saturation with relational reasoning and symbolic bijection inference. It further localizes discrepancies to precise code sites, turning verification results into actionable debugging guidance. Scalify verifies models as large as Llama-3.1-405B within minutes on a commodity machine and exposed five unknown bugs in Amazon production machine learning frameworks.

CVMay 8, 2023

Adversarial Examples Detection with Enhanced Image Difference Features based on Local Histogram Equalization

Zhaoxia Yin, Shaowei Zhu, Hang Su et al.

Deep Neural Networks (DNNs) have recently made significant progress in many fields. However, studies have shown that DNNs are vulnerable to adversarial examples, where imperceptible perturbations can greatly mislead DNNs even if the full underlying model parameters are not accessible. Various defense methods have been proposed, such as feature compression and gradient masking. However, numerous studies have proven that previous methods create detection or defense against certain attacks, which renders the method ineffective in the face of the latest unknown attack methods. The invisibility of adversarial perturbations is one of the evaluation indicators for adversarial example attacks, which also means that the difference in the local correlation of high-frequency information in adversarial examples and normal examples can be used as an effective feature to distinguish the two. Therefore, we propose an adversarial example detection framework based on a high-frequency information enhancement strategy, which can effectively extract and amplify the feature differences between adversarial examples and normal examples. Experimental results show that the feature augmentation module can be combined with existing detection models in a modular way under this framework. Improve the detector's performance and reduce the deployment cost without modifying the existing detection model.

CVOct 6, 2021

Reversible Attack based on Local Visual Adversarial Perturbation

Li Chen, Shaowei Zhu, Zhaoxia Yin

Adding perturbations to images can mislead classification models to produce incorrect results. Recently, researchers exploited adversarial perturbations to protect image privacy from retrieval by intelligent models. However, adding adversarial perturbations to images destroys the original data, making images useless in digital forensics and other fields. To prevent illegal or unauthorized access to sensitive image data such as human faces without impeding legitimate users, the use of reversible adversarial attack techniques is increasing. The original image can be recovered from its reversible adversarial examples. However, existing reversible adversarial attack methods are designed for traditional imperceptible adversarial perturbations and ignore the local visible adversarial perturbation. In this paper, we propose a new method for generating reversible adversarial examples based on local visible adversarial perturbation. The information needed for image recovery is embedded into the area beyond the adversarial patch by the reversible data hiding technique. To reduce image distortion, lossless compression and the B-R-G (bluered-green) embedding principle are adopted. Experiments on CIFAR-10 and ImageNet datasets show that the proposed method can restore the original images error-free while ensuring good attack performance.