Simone Manoni

11.2ARApr 30

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Emanuele Venieri, Simone Manoni, Alberto Florian et al.

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.

61.5DCApr 22

Monte Cimone v3: Where RISC-V Stands in High-Performance Computing

Emanuele Venieri, Simone Manoni, Giacomo Madella et al.

The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044 processor, an evolution of the SG2042 used in MCv2. We characterize MCv3 using HPL and STREAM benchmarks coupled with power measurements, and compare it against two reference platforms: the Intel Xeon Platinum 8480+(Sapphire Rapids) and the NVIDIA Grace CPU Superchip. Our results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip.

Simone Manoni

2 Papers