Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers
This addresses the computational bottleneck of summation in numerical computing for hardware designers, offering incremental improvements in efficiency for specific data types.
The paper tackles the problem of efficiently summing long sequences of floating-point numbers by introducing a method with accumulation and reconstruction phases, achieving a tensor core that multiplies and accumulates two 4x4 matrices of bfloat16 values per clock cycle using ~6,400 LUTs + 64 DSP48 in AMD FPGAs at 700+ MHz, and extends the approach to posits and logarithmic numbers.
This paper discusses a simple and effective method for the summation of long sequences of floating point numbers. The method comprises two phases: an accumulation phase where the mantissas of the floating point numbers are added to accumulators indexed by the exponents and a reconstruction phase where the actual summation result is finalised. Various architectural details are given for both FPGAs and ASICs including fusing the operation with a multiplier, creating efficient MACs. Some results are presented for FPGAs, including a tensor core capable of multiplying and accumulating two 4x4 matrices of bfloat16 values every clock cycle using ~6,400 LUTs + 64 DSP48 in AMD FPGAs at 700+ MHz. The method is then extended to posits and logarithmic numbers.