CORDIC Is All You Need
This work addresses efficient hardware acceleration for AI workloads like Transformers and DNNs, particularly for edge AI applications, though it appears incremental as it builds on existing systolic array and CORDIC methods.
The paper tackles the need for adaptable hardware accelerators in AI by presenting a pipelined architecture with a CORDIC block for linear and nonlinear computations, achieving up to 4.64x enhanced throughput and reductions in power and area by 5.02x and 4.06x at 28 nm CMOS with minor accuracy loss.
Artificial intelligence necessitates adaptable hardware accelerators for efficient high-throughput million operations. We present pipelined architecture with CORDIC block for linear MAC computations and nonlinear iterative Activation Functions (AF) such as $tanh$, $sigmoid$, and $softmax$. This approach focuses on a Reconfigurable Processing Engine (RPE) based systolic array, with 40\% pruning rate, enhanced throughput up to 4.64$\times$, and reduction in power and area by 5.02 $\times$ and 4.06 $\times$ at CMOS 28 nm, with minor accuracy loss. FPGA implementation achieves a reduction of up to 2.5 $\times$ resource savings and 3 $\times$ power compared to prior works. The Systolic CORDIC engine for Reconfigurability and Enhanced throughput (SYCore) deploys an output stationary dataflow with the CAESAR control engine for diverse AI workloads such as Transformers, RNNs/LSTMs, and DNNs for applications like image detection, LLMs, and speech recognition. The energy-efficient and flexible approach extends the enhanced approach for edge AI accelerators supporting emerging workloads.