Periodic Online Testing for Sparse Systolic Tensor Arrays
This addresses reliability issues in safety-critical systems using ML hardware, but it is incremental as it builds on existing testing methods for systolic arrays.
The paper tackled the problem of ensuring reliability in sparse systolic tensor arrays used for structured-sparse ML models by introducing an online error-checking technique that detects and locates permanent faults before computation, achieving very high fault coverage with minimal overhead.
Modern Machine Learning (ML) applications often benefit from structured sparsity, a technique that efficiently reduces model complexity and simplifies handling of sparse data in hardware. Sparse systolic tensor arrays - specifically designed to accelerate these structured-sparse ML models - play a pivotal role in enabling efficient computations. As ML is increasingly integrated into safety-critical systems, it is of paramount importance to ensure the reliability of these systems. This paper introduces an online error-checking technique capable of detecting and locating permanent faults within sparse systolic tensor arrays before computation begins. The new technique relies on merely four test vectors and exploits the weight values already loaded within the systolic array to comprehensively test the system. Fault-injection campaigns within the gate-level netlist, while executing three well-established Convolutional Neural Networks (CNN), validate the efficiency of the proposed approach, which is shown to achieve very high fault coverage, while incurring minimal performance and area overheads.