Zizhen Liu

AR
6papers
26citations
Novelty53%
AI Score53

6 Papers

74.3ARMay 27
FT-Pilot: Automated Fault-Tolerant RTL Rewriting via Vulnerability-Guided LLMs

Weixing Liu, Zizhen Liu, Jing Ye et al.

As integrated circuit technologies continue to scale toward advanced process nodes, the continual reduction in node capacitance and supply voltage has made digital systems increasingly vulnerable to soft errors. Although traditional full-chip hardening methods can improve reliability, they often incur unacceptable area and power overhead, making selective hardening a more practical engineering solution. However, existing approaches typically rely on time-consuming fault-injection simulation to determine hardening locations through vulnerability analysis, and still depend heavily on manual strategy selection and RTL modification during the hardening stage, making them ill-suited for efficient automated reliability optimization at early design stages. To address these challenges, this paper proposes FT-Pilot, a GNN-guided LLM framework for automatic RTL soft-error hardening. The framework first employs a GNN to identify critical vulnerable assets directly at the RTL level, and then introduces an LLM-driven rewriting engine composed of an analyzer and a rewriter, which performs RTL-level fault-tolerant code rewriting with the support of dual-knowledge-base retrieval-augmented generation and an automatic repair mechanism. Experimental results show that the proposed framework can automatically generate hardened RTL designs that are syntactically correct, functionally correct, and synthesizable across multiple benchmark circuits, while significantly reducing output error rates under soft-error scenarios. This work provides a practical automated path toward shift-left reliability optimization at the RTL level.

ARDec 25, 2025
Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study

Duo Chai, Zizhen Liu, Shuhuai Wang et al.

Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis of modern large-scale LLMs remains limited, despite their rapid adoption in diverse application scenarios. Given the unique characteristics of LLMs, their resilience to soft errors may differ substantially from earlier models. To bridge this gap, we conduct the first instruction-level fault injection study of LLM inference. Our approach reveals reliability characteristics from multiple perspectives, highlighting the effects of model architecture, parameter scale, and task complexity. These findings provide new insights into LLM reliability and inform the design of more effective fault tolerance mechanisms.

24.9SEMay 17
Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing

Zizhen Liu, Xiaoguang Mao, Deheng Yang et al.

Fault localization in modern processor design code is a critical yet time-consuming step during processor verification. While recent advances in LLM-based techniques for module-level hardware design have shown promising results, automatically localizing bugs in large-scale, project-level processor designs remains challenging. In this paper, we present BluesFL, a novel block-level LLM-based fault localization framework for processor designs. Inspired by the way engineers debug processors, we first propose a dataflow-based code blockization approach to guide LLMs to focus on critical local code context. We further propose a Block-Level Instruction-Oriented Slicing (Blues) algorithm that enables LLMs to mimic human reasoning by analyzing instruction execution paths and processor states. We evaluate BluesFL on a real-world RISC-V processor core comprising 19K lines of SystemVerilog code. Experimental results demonstrate that BluesFL correctly localizes 24 bugs at Top-1, achieving 242.9% improvement over the existing state-of-the-art (7 bugs). Cost analysis shows that BluesFL requires an average of only $0.257 to localize a single bug.

31.5CRMar 24
On the Vulnerability of FHE Computation to Silent Data Corruption

Jianan Mu, Ge Yu, Zhaoxuan Kan et al.

Fully Homomorphic Encryption (FHE) is rapidly emerging as a promising foundation for privacy-preserving cloud services, enabling computation directly on encrypted data. As FHE implementations mature and begin moving toward practical deployment in domains such as secure finance, biomedical analytics, and privacy-preserving AI, a critical question remains insufficiently explored: how reliable is FHE computation on real hardware? This question is especially important because, compared with plaintext computation, FHE incurs much higher computational overhead, making it more susceptible to transient hardware faults. Moreover, data corruptions are likely to remain silent: the FHE service has no access to the underlying plaintext, causing unawareness even though the corresponding decrypted result has already been corrupted. To this end, we conduct a comprehensive evaluation of SDCs in FHE ciphertext computation. Through large-scale fault-injection experiments, we characterize the vulnerability of FHE to transient faults, and through a theoretical analysis of error-propagation behaviors, we gain deeper algorithmic insight into the mechanisms underlying this vulnerability. We further assess the effectiveness of different fault-tolerance mechanisms for mitigating these faults.

ARNov 25, 2025
InF-ATPG: Intelligent FFR-Driven ATPG with Advanced Circuit Representation Guided Reinforcement Learning

Bin Sun, Rengang Zhang, Zhiteng Chao et al.

Automatic test pattern generation (ATPG) is a crucial process in integrated circuit (IC) design and testing, responsible for efficiently generating test patterns. As semiconductor technology progresses, traditional ATPG struggles with long execution times to achieve the expected fault coverage, which impacts the time-to-market of chips. Recent machine learning techniques, like reinforcement learning (RL) and graph neural networks (GNNs), show promise but face issues such as reward delay in RL models and inadequate circuit representation in GNN-based methods. In this paper, we propose InF-ATPG, an intelligent FFR-driven ATPG framework that overcomes these challenges by using advanced circuit representation to guide RL. By partitioning circuits into fanout-free regions (FFRs) and incorporating ATPG-specific features into a novel QGNN architecture, InF-ATPG enhances test pattern generation efficiency. Experimental results show InF-ATPG reduces backtracks by 55.06\% on average compared to traditional methods and 38.31\% compared to the machine learning approach, while also improving fault coverage.

CRNov 24, 2021
SASH: Efficient Secure Aggregation Based on SHPRG For Federated Learning

Zizhen Liu, Si Chen, Jing Ye et al.

To prevent private training data leakage in Fed?erated Learning systems, we propose a novel se?cure aggregation scheme based on seed homomor?phic pseudo-random generator (SHPRG), named SASH. SASH leverages the homomorphic property of SHPRG to simplify the masking and demask?ing scheme, which for each of the clients and for the server, entails an overhead linear w.r.t model size and constant w.r.t number of clients. We prove that even against worst-case colluding adversaries, SASH preserves training data privacy, while being resilient to dropouts without extra overhead. We experimentally demonstrate SASH significantly improves the efficiency to 20 times over baseline, especially in the more realistic case where the numbers of clients and model size become large, and a cer?tain percentage of clients drop out from the system.