FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition
This enables efficient deployment of complex 3D CNNs for Human Action Recognition on FPGAs, addressing computational and memory bottlenecks for applications like surveillance and autonomous vehicles, though it is incremental in applying existing methods to new hardware.
The paper tackles the challenge of deploying the X3D model for Human Action Recognition on resource-constrained systems by developing an FPGA-based toolflow that optimizes hardware mapping, achieving state-of-the-art accuracy of 95.5% on UCF101 while improving performance-accuracy trade-offs.
3D Convolutional Neural Networks are gaining increasing attention from researchers and practitioners and have found applications in many domains, such as surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval. However, their widespread adoption is hindered by their high computational and memory requirements, especially when resource-constrained systems are targeted. This paper addresses the problem of mapping X3D, a state-of-the-art model in Human Action Recognition that achieves accuracy of 95.5\% in the UCF101 benchmark, onto any FPGA device. The proposed toolflow generates an optimised stream-based hardware system, taking into account the available resources and off-chip memory characteristics of the FPGA device. The generated designs push further the current performance-accuracy pareto front, and enable for the first time the targeting of such complex model architectures for the Human Action Recognition task.