Sumit Bam Shrestha

NE
h-index34
13papers
1,476citations
Novelty50%
AI Score41

13 Papers

AIApr 10, 2023
NeuroBench: A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems

Jason Yik, Korneel Van den Berghe, Douwe den Blanken et al. · eth-zurich

Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of researchers across industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we outline tasks and guidelines for benchmarks across multiple application domains, and present initial performance baselines across neuromorphic and conventional approaches for both benchmark tracks. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community.

LGSep 23, 2024
A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing

Svea Marie Meyer, Philipp Weidel, Philipp Plank et al.

Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel's Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.

NEAug 28, 2024
Emulating Brain-like Rapid Learning in Neuromorphic Edge Computing

Kenneth Stewart, Michael Neumeier, Sumit Bam Shrestha et al.

Achieving personalized intelligence at the edge with real-time learning capabilities holds enormous promise in enhancing our daily experiences and helping decision making, planning, and sensing. However, efficient and reliable edge learning remains difficult with current technology due to the lack of personalized data, insufficient hardware capabilities, and inherent challenges posed by online learning. Over time and across multiple developmental stages, the brain has evolved to efficiently incorporate new knowledge by gradually building on previous knowledge. In this work, we emulate the multiple stages of learning with digital neuromorphic technology that simulates the neural and synaptic processes of the brain using two stages of learning. First, a meta-training stage trains the hyperparameters of synaptic plasticity for one-shot learning using a differentiable simulation of the neuromorphic hardware. This meta-training process refines a hardware local three-factor synaptic plasticity rule and its associated hyperparameters to align with the trained task domain. In a subsequent deployment stage, these optimized hyperparameters enable fast, data-efficient, and accurate learning of new classes. We demonstrate our approach using event-driven vision sensor data and the Intel Loihi neuromorphic processor with its plasticity dynamics, achieving real-time one-shot learning of new classes that is vastly improved over transfer learning. Our methodology can be deployed with arbitrary plasticity models and can be applied to situations demanding quick learning and adaptation at the edge, such as navigating unfamiliar environments or learning unexpected categories of data through user engagement.

RODec 3, 2025
Autonomous Reinforcement Learning Robot Control with Intel's Loihi 2 Neuromorphic Hardware

Kenneth Stewart, Roxana Leontie, Samantha Chapin et al.

We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.

NEFeb 12, 2025
Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

Steven Abreu, Sumit Bam Shrestha, Rui-Jie Zhu et al.

Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel's neuromorphic processor, Loihi 2. Our approach leverages Loihi 2's support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3x higher throughput with 2x less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.

LGFeb 3, 2025
Accelerating Linear Recurrent Neural Networks for the Edge with Unstructured Sparsity

Alessandro Pierro, Steven Abreu, Jonathan Timcheck et al.

Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements--when accelerated by compatible hardware platforms. In this paper, we conduct a scaling study to investigate the Pareto front of performance and efficiency across inference compute budgets. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with 2x less compute and 36% less memory at iso-accuracy. Our models achieve state-of-the-art results on a real-time streaming task for audio denoising. By quantizing our sparse models to fixed-point arithmetic and deploying them on the Intel Loihi 2 neuromorphic chip for real-time processing, we translate model compression into tangible gains of 42x lower latency and 149x lower energy consumption compared to a dense model on an edge GPU. Our findings showcase the transformative potential of unstructured sparsity, paving the way for highly efficient recurrent neural networks in real-world, resource-constrained environments.

CVMar 21, 2025
Region Masking to Accelerate Video Processing on Neuromorphic Hardware

Sreetama Sarkar, Sumit Bam Shrestha, Yue Che et al.

The rapidly growing demand for on-chip edge intelligence on resource-constrained devices has motivated approaches to reduce energy and latency of deep learning models. Spiking neural networks (SNNs) have gained particular interest due to their promise to reduce energy consumption using event-based processing. We assert that while sigma-delta encoding in SNNs can take advantage of the temporal redundancy across video frames, they still involve a significant amount of redundant computations due to processing insignificant events. In this paper, we propose a region masking strategy that identifies regions of interest at the input of the SNN, thereby eliminating computation and data movement for events arising from unimportant regions. Our approach demonstrates that masking regions at the input not only significantly reduces the overall spiking activity of the network, but also provides significant improvement in throughput and latency. We apply region masking during video object detection on Loihi 2, demonstrating that masking approximately 60% of input regions can reduce energy-delay product by 1.65x over a baseline sigma-delta network, with a degradation in mAP@0.5 by 1.09%.

CLJun 4, 2024
Scalable MatMul-free Language Modeling

Rui-Jie Zhu, Yu Zhang, Steven Abreu et al.

Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61% during training and over 10x during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4x higher throughput with 10x less energy than edge GPUs.

ETNov 5, 2021
Efficient Neuromorphic Signal Processing with Loihi 2

Garrick Orchard, E. Paxon Frady, Daniel Ben Dayan Rubin et al.

The biologically inspired spiking neurons used in neuromorphic computing are nonlinear filters with dynamic state variables -- very different from the stateless neuron models used in deep learning. The next version of Intel's neuromorphic research processor, Loihi 2, supports a wide range of stateful spiking neuron models with fully programmable dynamics. Here we showcase advanced spiking neuron models that can be used to efficiently process streaming data in simulation experiments on emulated Loihi 2 hardware. In one example, Resonate-and-Fire (RF) neurons are used to compute the Short Time Fourier Transform (STFT) with similar computational complexity but 47x less output bandwidth than the conventional STFT. In another example, we describe an algorithm for optical flow estimation using spatiotemporal RF neurons that requires over 90x fewer operations than a conventional DNN-based solution. We also demonstrate promising preliminary results using backpropagation to train RF neurons for audio classification tasks. Finally, we show that a cascade of Hopf resonators - a variant of the RF neuron - replicates novel properties of the cochlea and motivates an efficient spike-based spectrogram encoder.

NEAug 3, 2020
Online Few-shot Gesture Learning on a Neuromorphic Processor

Kenneth Stewart, Garrick Orchard, Sumit Bam Shrestha et al.

We present the Surrogate-gradient Online Error-triggered Learning (SOEL) system for online few-shot learning on neuromorphic processors. The SOEL learning system uses a combination of transfer learning and principles of computational neuroscience and deep learning. We show that partially trained deep Spiking Neural Networks (SNNs) implemented on neuromorphic hardware can rapidly adapt online to new classes of data within a domain. SOEL updates trigger when an error occurs, enabling faster learning with fewer updates. Using gesture recognition as a case study, we show SOEL can be used for online few-shot learning of new classes of pre-recorded gesture data and rapid online learning of new gestures from data streamed live from a Dynamic Active-pixel Vision Sensor to an Intel Loihi neuromorphic research processor.

NEMar 5, 2020
Event-Based Angular Velocity Regression with Spiking Networks

Mathias Gehrig, Sumit Bam Shrestha, Daniel Mouritzen et al.

Spiking Neural Networks (SNNs) are bio-inspired networks that process information conveyed as temporal spikes rather than numeric values. A spiking neuron of an SNN only produces a spike whenever a significant number of spikes occur within a short period of time. Due to their spike-based computational model, SNNs can process output from event-based, asynchronous sensors without any pre-processing at extremely lower power unlike standard artificial neural networks. This is possible due to specialized neuromorphic hardware that implements the highly-parallelizable concept of SNNs in silicon. Yet, SNNs have not enjoyed the same rise of popularity as artificial neural networks. This not only stems from the fact that their input format is rather unconventional but also due to the challenges in training spiking networks. Despite their temporal nature and recent algorithmic advances, they have been mostly evaluated on classification problems. We propose, for the first time, a temporal regression problem of numerical values given events from an event camera. We specifically investigate the prediction of the 3-DOF angular velocity of a rotating event camera with an SNN. The difficulty of this problem arises from the prediction of angular velocities continuously in time directly from irregular, asynchronous event-based input. Directly utilising the output of event cameras without any pre-processing ensures that we inherit all the benefits that they provide over conventional cameras. That is high-temporal resolution, high-dynamic range and no motion blur. To assess the performance of SNNs on this task, we introduce a synthetic event camera dataset generated from real-world panoramic images and show that we can successfully train an SNN to perform angular velocity regression.

NEOct 11, 2019
On-chip Few-shot Learning with Surrogate Gradient Descent on a Neuromorphic Processor

Kenneth Stewart, Garrick Orchard, Sumit Bam Shrestha et al.

Recent work suggests that synaptic plasticity dynamics in biological models of neurons and neuromorphic hardware are compatible with gradient-based learning (Neftci et al., 2019). Gradient-based learning requires iterating several times over a dataset, which is both time-consuming and constrains the training samples to be independently and identically distributed. This is incompatible with learning systems that do not have boundaries between training and inference, such as in neuromorphic hardware. One approach to overcome these constraints is transfer learning, where a portion of the network is pre-trained and mapped into hardware and the remaining portion is trained online. Transfer learning has the advantage that pre-training can be accelerated offline if the task domain is known, and few samples of each class are sufficient for learning the target task at reasonable accuracies. Here, we demonstrate on-line surrogate gradient few-shot learning on Intel's Loihi neuromorphic research processor using features pre-trained with spike-based gradient backpropagation-through-time. Our experimental results show that the Loihi chip can learn gestures online using a small number of shots and achieve results that are comparable to the models simulated on a conventional processor.

NESep 5, 2018
SLAYER: Spike Layer Error Reassignment in Time

Sumit Bam Shrestha, Garrick Orchard

Configuring deep Spiking Neural Networks (SNNs) is an exciting research avenue for low power spike event based computation. However, the spike generation function is non-differentiable and therefore not directly compatible with the standard error backpropagation algorithm. In this paper, we introduce a new general backpropagation mechanism for learning synaptic weights and axonal delays which overcomes the problem of non-differentiability of the spike function and uses a temporal credit assignment policy for backpropagating error to preceding layers. We describe and release a GPU accelerated software implementation of our method which allows training both fully connected and convolutional neural network (CNN) architectures. Using our software, we compare our method against existing SNN based learning approaches and standard ANN to SNN conversion techniques and show that our method achieves state of the art performance for an SNN on the MNIST, NMNIST, DVS Gesture, and TIDIGITS datasets.