ARMay 21
Emerging memory technologies at room/cryogenic temperatureSiddhartha Raman Sundara Raman
As conventional technology scaling approaches physical and power limitations, modern computing systems increasingly face performance bottlenecks arising from memory latency, energy consumption, scalability constraints, and data movement overheads. Simultaneously, emerging workloads such as machine learning, graph analytics, and scientific computing demand memory technologies with higher bandwidth, lower latency, improved energy efficiency, and greater storage density. These challenges have motivated extensive research into both room-temperature memories and cryogenic memory systems targeted toward superconducting and quantum computing platforms. This chapter presents an overview of volatile and non-volatile memory technologies operating across room-temperature and cryogenic environments. The discussion includes SRAM, DRAM, embedded DRAM (eDRAM), NAND/NOR Flash, Resistive Random Access Memory (RRAM), Magneto-resistive Random Access Memory (MRAM), and Ferroelectric Field-Effect Transistor (FeFET)-based memories. In addition, cryogenic technologies including UTBB-SOI-based pseudo-static storage circuits and Josephson Junction Field-Effect Transistor (JJFET)-based devices are discussed in the context of ultra-low-temperature computing systems. The chapter highlights the operational principles, read/write mechanisms, retention behavior, and tradeoffs among area, performance, scalability, and energy efficiency across these memory technologies, while examining challenges and opportunities for future room-temperature and cryogenic computing architectures.
ARMay 19
A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networksSiddhartha Raman Sundara Raman, Lizy John, Jaydeep P. Kulkarni
Graph neural networks (GNNs) have gained significant interest for applications such as citation network analysis and drug discovery due to their ability to apply machine learning techniques on graph-structured data. GNNs typically employ a two-stage execution pipeline consisting of combination and aggregation kernels. The combination stage performs data-intensive convolution operations with relatively regular memory access patterns, whereas the aggregation stage operates on sparse graph data with highly irregular accesses. These heterogeneous memory behaviors make conventional CPU- and GPU-based execution energy inefficient due to substantial data movement overheads. Existing accelerators attempt to mitigate these challenges using specialized architectures and processing-in-memory (PIM) techniques. However, prior approaches often suffer from scalability limitations, area overheads, restricted parallelism, and energy inefficiencies associated with analog compute and dedicated accelerator structures. This paper presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. The proposed design introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation using a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results demonstrate that NEM-GNN achieves approximately 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density compared to prior state-of-the-art approaches.
ARMay 16
A comprehensive study on ILP acceleration accounting for sparsity, area, energy, data movement using near-memory architectureSiddhartha Raman Sundara Raman, Lizy K John, Jaydeep P. Kulkarni
Integer Linear Programming (ILP) is widely used for solving real-world optimization problems, including network routing, map routing, and traffic scheduling. However, ILP algorithms are sparse and branch-intensive, making them inefficient on conventional CPUs and GPUs. Prior work has shown that large-scale ILP problems can require tens of hours of execution time even on massively parallel systems, limiting their applicability to time-sensitive decision-making workloads. Existing ILP solvers such as Gurobi employ software-level optimizations to handle sparsity on CPUs, but still face throughput limitations. GPU-based ILP solvers are also constrained because GPUs are not well suited for sparse and branch-heavy workloads, leading to thread divergence, under-utilization of streaming multiprocessors, and frequent host-device interactions. This paper presents SPARK, a sparsity-aware, reuse-aware, energy-efficient, reconfigurable near-cache ILP accelerator. SPARK repurposes the existing L1 cache in CPUs to provide near-cache acceleration with minimal hardware overhead of approximately 1.4\% of the CPU area. The architecture performs near-cache sparsity detection and sparsity-aware computation to reduce insignificant computations and data movement energy. SPARK also exploits computational reuse patterns in ILP algorithms to improve parallelism and efficiency. The proposed design supports both sparse and dense ILPs as well as Linear Programs (LPs). Evaluations on real-world workloads from MIPLIB 2017 show that SPARK achieves up to 15x and 20x performance improvement, and up to 152x and 740x energy reduction compared to AMD Zen3 CPUs and NVIDIA Tesla V100 GPUs, respectively, for sparse ILPs. For sparse LPs, SPARK achieves 7-17x performance improvement and 103-250x energy reduction over CPU and GPU baselines, demonstrating the broad applicability of the proposed architecture.
ARMay 13
A detailed algorithmic study on a reuse-aware, near memory, all-digital Ising machineSiddhartha Raman Sundara Raman, Lizy K. John, Jaydeep P. Kulkarni
Recently, nature-inspired computing approaches have gained significant attention for solving difficult optimization problems, particularly through Ising machines for NP-complete applications. Existing Ising accelerators range from quantum and optical annealers to CMOS-based von-Neumann and in-memory architectures. However, many prior designs are specialized accelerators limited to specific problem classes, rely on ADC/DAC circuits, and suffer from reliability challenges due to process-variation-sensitive embedded memory technologies. This paper presents SACHI, an all-digital Ising architecture implemented by repurposing the L1 cache of a CPU using SRAM-based processing-in-memory techniques. SACHI eliminates the need for ADCs/DACs, improves reliability compared to prior approaches such as BRIM, and enables Ising acceleration with minimal hardware overhead integrated into the CPU pipeline. The paper also provides detailed architectural analysis and pseudo-code for the proposed algorithms. The key contributions of SACHI are: (i) tight integration of the accelerator with the CPU pipeline, (ii) reuse of existing cache hardware for acceleration, (iii) higher parallelism enabled through reuse-aware computation, and (iv) improved performance and energy efficiency for large-scale, high-precision optimization problems using novel compute and mapping strategies. Compared to BRIM, SACHI achieves 300x performance improvement and 80x energy reduction across applications including asset allocation, molecular dynamics, image segmentation, and traveling salesman problems. Additionally, reuse factors up to 4000x are observed for several workloads. This work demonstrates that reliable and efficient all-digital Ising acceleration can be achieved using commodity SRAM structures tightly integrated with general-purpose processors.
ARApr 4
ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising computeSiddhartha Raman Sundara Raman, Jaydeep P. Kulkarni
We present a tightly integrated and unified near-memory GPU architecture that delivers 6 to 16 times speedup and 6 to 13 times energy savings across Convolutional Neural Networks, Graph Convolutional Networks, Linear Programming, Large Language Models, and Ising workloads compared to MIAOW GPU. The design includes a custom sparsity-aware near-memory circuit providing about 1.5 times energy savings, and a lightweight softmax circuit providing about 1.6 times energy savings. The architecture supports reconfigurable compute up to INT16 with dynamic resolution updates and scales efficiently across problem sizes. ABI-enabled MI300 and Blackwell systems achieve about 4.5 times speedup over baseline MI300 and Blackwell.
ARApr 6
A comparative study on power delivery aspects of compute-in/near-memory approaches using DRAMSiddhartha Raman Sundara Raman, Siyuan Ma, Lizy Kurian John
Compute-in-memory (PIM) mitigates the memory wall by performing computation within memory, reducing data movement and improving energy efficiency. DRAM-based PIM is particularly attractive due to its high density, mature manufacturing ecosystem, and compatibility with existing systems. Recent works exploit multiple levels of the DRAM hierarchy - including subarrays, banks, and 3D-stacked organizations - to enable in-memory computation using mechanisms such as multi-row activation, row-buffer operations, and near-bank compute units. However, these approaches introduce non-traditional current demand patterns that challenge the power delivery network (PDN). This paper surveys PDN challenges in DRAM-based PIM systems and proposes a unified taxonomy that characterizes PIM-induced current behavior along temporal (burst vs. sustained) and spatial (localized vs. distributed) dimensions. Using this framework, we analyze how representative PIM techniques stress the PDN through bursty activations, multi-row concurrency, and large-scale parallel execution, leading to voltage droop, IR drop, and thermal hotspots. We further discuss DRAM-specific mitigation strategies leveraging existing architectural and circuit-level mechanisms, including timing constraints, memory controller scheduling, data placement, and bank- and vault-level power management. This survey highlights the importance of PDN-aware design for scalable and reliable DRAM-based PIM systems and outlines key future research directions.