LGApr 1, 2025

NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference

Marta Andronic, George A. Constantinides

arXiv:2504.00592v118.814 citationsh-index: 7FCCM

Originality Highly original

AI Analysis

This addresses efficiency and accuracy challenges for deploying neural networks on FPGAs in edge computing applications like particle physics, representing a strong specific gain rather than a broad paradigm shift.

The paper tackled the accuracy degradation in lookup table (LUT)-based neural networks due to exponential resource scaling by proposing NeuraLUT-Assemble, a framework that assembles larger neurons from smaller units with mixed-precision and skip-connections, achieving competitive accuracy and up to 8.42× reduction in area-delay product compared to state-of-the-art.

Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications, particularly when deployed on field-programmable gate arrays (FPGAs) for edge computing. These architectures promise ultra-low latency and reduced resource utilization, broadening neural network adoption in fields such as particle physics. However, existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. In practice, in prior work this tension has resulted in the reliance on extremely sparse models. We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units, thereby increasing connectivity while keeping the number of inputs of any given LUT manageable. Additionally, we introduce skip-connections across entire LUT structures to improve gradient flow. NeuraLUT-Assemble closes the accuracy gap between LUT-based methods and (fully-connected) MLP-based models, achieving competitive accuracy on tasks such as network intrusion detection, digit classification, and jet classification, demonstrating up to $8.42\times$ reduction in the area-delay product compared to the state-of-the-art at the time of the publication.

View on arXiv PDF

Similar