NEARMay 5, 2021

Dynamic Reliability Management in Neuromorphic Computing

arXiv:2105.02038v111 citations
Originality Incremental advance
AI Analysis

This addresses reliability challenges for neuromorphic hardware used in machine learning, offering a more efficient alternative to existing fixed-interval de-stressing methods, though it appears incremental as it builds on prior reliability-oriented techniques.

The paper tackles the problem of aging-related reliability issues in neuromorphic computing systems caused by elevated voltages and currents in non-volatile memory, which degrade CMOS transistors and impact performance. It proposes a dynamic run-time manager (NCRTM) that intelligently schedules de-stress operations to meet reliability targets, resulting in significant reliability improvements with marginal performance impact.

Neuromorphic computing systems uses non-volatile memory (NVM) to implement high-density and low-energy synaptic storage. Elevated voltages and currents needed to operate NVMs cause aging of CMOS-based transistors in each neuron and synapse circuit in the hardware, drifting the transistor's parameters from their nominal values. Aggressive device scaling increases power density and temperature, which accelerates the aging, challenging the reliable operation of neuromorphic systems. Existing reliability-oriented techniques periodically de-stress all neuron and synapse circuits in the hardware at fixed intervals, assuming worst-case operating conditions, without actually tracking their aging at run time. To de-stress these circuits, normal operation must be interrupted, which introduces latency in spike generation and propagation, impacting the inter-spike interval and hence, performance, e.g., accuracy. We propose a new architectural technique to mitigate the aging-related reliability problems in neuromorphic systems, by designing an intelligent run-time manager (NCRTM), which dynamically destresses neuron and synapse circuits in response to the short-term aging in their CMOS transistors during the execution of machine learning workloads, with the objective of meeting a reliability target. NCRTM de-stresses these circuits only when it is absolutely necessary to do so, otherwise reducing the performance impact by scheduling de-stress operations off the critical path. We evaluate NCRTM with state-of-the-art machine learning workloads on a neuromorphic hardware. Our results demonstrate that NCRTM significantly improves the reliability of neuromorphic hardware, with marginal impact on performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes