Strix: Re-thinking NPU Reliability from a System Perspective
For safety-critical systems using DNN/LLM accelerators, Strix bridges the gap between reliability requirements and deployable solutions with low overhead.
Strix introduces a full-stack NPU reliability framework that re-partitions the NPU along the inference pipeline, achieving sub-microsecond fault localization, detection, and correction with only 1.04x slowdown and minimal hardware overhead.
DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.