Guokai Chen

h-index1
2papers

2 Papers

AROct 2, 2022
RISC-V Toolchain and Agile Development based Open-source Neuromorphic Processor

Jiulong Wang, Ruopu Wu, Guokai Chen et al.

In recent decades, neuromorphic computing aiming to imitate brains' behaviors has been developed in various fields of computer science. The Artificial Neural Network (ANN) is an important concept in Artificial Intelligence (AI). It is utilized in recognition and classification. To explore a better way to simulate obtained brain behaviors, which is fast and energy-efficient, on hardware, researchers need an advanced method such as neuromorphic computing. In this case, Spiking Neural Network (SNN) becomes an optimal choice in hardware implementation. Recent works are focusing on accelerating SNN computing. However, most accelerator solutions are based on CPU-accelerator architecture which is energy-inefficient due to the complex control flows in this structure. This paper proposes Wenquxing 22A, a low-power neuromorphic processor that combines general-purpose CPU functions and SNN to efficiently compute it with RISC-V SNN extension instructions. The main idea of Wenquxing 22A is to integrate the SNN calculation unit into the pipeline of a general-purpose CPU to achieve low-power computing with customized RISC-V SNN instructions version 1.0 (RV-SNN V1.0), Streamlined Leaky Integrate-and-Fire (LIF) model, and the binary stochastic Spike-timing-dependent-plasticity (STDP). The source code of Wenquxing 22A is released online on Gitee and GitHub. We apply Wenquxing 22A to the recognition of the MNIST dataset to make a comparison with other SNN systems. Our experiment results show that Wenquxing 22A improves the energy expenses by 5.13 times over the accelerator solution, ODIN, with approximately classification accuracy, 85.00% for 3-bit ODIN online learning, and 91.91% for 1-bit Wenquxing 22A.

ARJul 15, 2025
SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Jiawei Lin, Guokai Chen, Yuanlong Li et al.

Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays achieve high utilization primarily for consecutive and large matrix multiplications, whereas FlashAttention requires frequent interleaving of matrix multiplications and softmax operations. The frequent data swaps between matrix multiplications on the systolic array and softmax operations on external units result in low array utilization. Moreover, when these computations run concurrently, the softmax stage contends with matrix multiplication for register file and SRAM ports, further degrading performance. To overcome these limitations, we propose FSA, an enhanced systolic array architecture that enables the FlashAttention algorithm to run entirely within a single systolic array, eliminating the need for external vector units. At the core of FSA is SystolicAttention, a novel scheduling algorithm that maps FlashAttention operations onto systolic arrays with fine-grained, element-wise overlap. This approach significantly improves array utilization while preserving the original floating-point operation order to maintain numerical stability. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art commercial accelerators. Our results show that FSA achieves 1.77 and 4.83 times higher attention FLOPs/s utilization compared to AWS Neuron v2 and Google TPUv5e, respectively, with only 12% area overhead.