'1'-bit Count-based Sorting Unit to Reduce Link Power in DNN Accelerators
This work addresses power efficiency in DNN accelerators, but it is incremental as it builds on existing data reordering techniques with hardware optimizations.
The paper tackles the problem of interconnect power consumption in DNN accelerators by proposing a hardware implementation of a comparison-free sorting unit for CNNs, achieving up to 35.4% area reduction while maintaining 19.50% BT reduction compared to 20.42% of a precise implementation.
Interconnect power consumption remains a bottleneck in Deep Neural Network (DNN) accelerators. While ordering data based on '1'-bit counts can mitigate this via reduced switching activity, practical hardware sorting implementations remain underexplored. This work proposes the hardware implementation of a comparison-free sorting unit optimized for Convolutional Neural Networks (CNN). By leveraging approximate computing to group population counts into coarse-grained buckets, our design achieves hardware area reductions while preserving the link power benefits of data reordering. Our approximate sorting unit achieves up to 35.4% area reduction while maintaining 19.50\% BT reduction compared to 20.42% of precise implementation.