34.2ARJun 2
SpecPCM: A Low-power PCM-based In-Memory Computing Accelerator for Full-stack Mass Spectrometry AnalysisKeming Fan, Ashkan Moradifirouzabadi, Xiangjin Wu et al.
Mass spectrometry (MS) is essential for proteomics and metabolomics but faces impending challenges in efficiently processing the vast volumes of data. This paper introduces SpecPCM, an in-memory computing (IMC) accelerator designed to achieve substantial improvements in energy and delay efficiency for both MS spectral clustering and database (DB) search. SpecPCM employs analog processing with low-voltage swing and utilizes recently introduced phase change memory (PCM) devices based on superlattice materials, optimized for low-voltage and low-power programming. Our approach integrates contributions across multiple levels: application, algorithm, circuit, device, and instruction sets. We leverage a robust hyperdimensional computing (HD) algorithm with a novel dimension-packing method and develop specialized hardware for the end-to-end MS pipeline to overcome the non-ideal behavior of PCM devices. We further optimize multi-level PCM devices for different tasks by using different materials. We also perform a comprehensive design exploration to improve energy and delay efficiency while maintaining accuracy, exploring various combinations of hardware and software parameters controlled by the instruction set architecture (ISA). SpecPCM, with up to three bits per cell, achieves speedups of up to 82x and 143x for MS clustering and DB search tasks, respectively, along with a four-orders-of-magnitude improvement in energy efficiency compared with state-of-the-art CPU/GPU tools.
56.9ARApr 13
GEN-Graph: Heterogeneous PIM Accelerator for General Computational Patterns in Graph-based Dynamic ProgrammingYanru Chen, Runyang Tian, Zheyu Li et al.
While graph-based dynamic programming (DP) is a cornerstone of genomics and network analytics, its efficiency is hampered by fundamentally conflicting computational patterns. Matrix-centric DP drives regular, compute-bound network analytics, while topology-centric DP handles irregular, memory-bound genomic traversals. These two categories of DP have substantially different computation patterns and dataflows, which makes it difficult for a single homogeneous processing-in-memory (PIM) architecture to efficiently support both. This work presents GEN-Graph, a novel heterogeneous PIM chiplet that integrates two types of specialized compute tiles within a 2.5D package: Matrix-tile, a processing-using-memory (PUM) tile optimized for matrix-centric workloads, such as all-pairs shortest path (APSP); and traversal-tile, a processing-near-memory (PNM) tile optimized for traversal-centric DP workloads, such as DNA sequence alignment. Our hardware-software co-design employs recursive partitioning and reconfigurable windowed bit-parallel logic to ensure exact computation. Results show the matrix tile achieves 42.8x speedup and 392x energy efficiency over the NVIDIA H100 GPU for APSP. For sequence-to-graph alignment, the traversal tile sustains 2.56 million reads/s (short-reads) and 39.3 thousand reads/s (long-reads), outperforming state-of-the-art accelerators by up to 2.56x in throughput. GEN-Graph provides the first scalable, exact solution for general DP dataflows by matching hardware specialization to algorithmic structure.
ARDec 12, 2025
CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM InferenceYanru Chen, Runyang Tian, Yue Pan et al.
The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent connectivity. These challenges become particularly acute in the context of multimodal LLMs (MLLMs), as high-dimensional visual inputs are transformed into extensive token sequences, thereby inflating the key-value (KV) cache and imposing substantial data movement overheads to the LLM backbone. To address these issues, we present CHIME, a chiplet-based heterogeneous near-memory acceleration for edge MLLMs inference. CHIME leverages the complementary strengths of integrated monolithic 3D (M3D) DRAM and RRAM chiplets: DRAM supplies low-latency bandwidth for attention, while RRAM offers dense, non-volatile storage for weights. This heterogeneous hardware is orchestrated by a co-designed mapping framework that executes fused kernels near data, minimizing cross-chiplet traffic to maximize effective bandwidth. On FastVLM (0.6B/1.7B) and MobileVLM (1.7B/3B), CHIME achieves up to 54x speedup and up to 246x better energy efficiency per inference as compared to the edge GPU NVIDIA Jetson Orin NX. It sustains 116.5-266.5 token/J compared to Jetson's 0.7-1.1 token/J. Furthermore, it delivers up to 69.2x higher throughput than the state-of-the-art PIM accelerator FACIL. Compared to the M3D DRAM-only design, CHIME's heterogeneous memory further improves energy efficiency by 7% and performance by 2.4x.