ARApr 5

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

arXiv:2512.000535.2h-index: 4Has Code
Predicted impact top 94% in AR · last 90 daysOriginality Incremental advance
AI Analysis

This addresses performance and resource issues in open-source GPGPU hardware for deep learning acceleration, though it appears incremental as it builds on existing Tensor Core concepts.

The paper tackles inefficiencies in open-source dot product implementations for GPGPU Tensor Cores by proposing Ten-Four, a fused mixed-precision unit that integrates floating-point and integer pipelines, achieving a 4-cycle latency at 262.325 MHz and ~3.1x performance improvement over a baseline with less than 60% area cost.

Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core's numerical accuracy. Ten-Four achieves 4-cycle operation latency at 262.325 MHz Fmax, delivering 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA, demonstrating ~3.1x performance improvement over an equivalent Berkeley HardFloat-based implementation at less than 60% the area cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes