ARApr 12

Strix: Re-thinking NPU Reliability from a System Perspective

Jiapeng Guan, Jie Zhang, Hao Zhou, Ran Wei, Dean You, Hui Wang, Yingquan Wang, Tinglue Wang, Xudong Zhao, Jing Li, Zhe Jiang

arXiv:2604.1048493.8h-index: 6Has Code

AI Analysis

For safety-critical systems using DNN/LLM accelerators, Strix bridges the gap between reliability requirements and deployable solutions with low overhead.

Strix introduces a full-stack NPU reliability framework that re-partitions the NPU along the inference pipeline, achieving sub-microsecond fault localization, detection, and correction with only 1.04x slowdown and minimal hardware overhead.

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.

View on arXiv PDF

Similar