CVJul 8, 2022

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

arXiv:2207.03782v13 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This provides an efficient solution for industrial deployment on embedded devices like FPGA boards, though it is incremental as it builds on existing ConvNet redesigns.

The paper tackled the problem of inefficient video recognition models like Vision Transformers by designing a modernized 2D ConvNet backbone for action recognition, achieving comparable results to ViT with 5x-10x fewer training epochs on benchmark datasets.

Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition. Particularly, our main target is to serve for industrial product deployment, such as FPGA boards in which only standard operations are supported. Therefore, our network simply consists of 2D convolutions, without using any 3D convolution, long-range attention plugin, or Transformer blocks. While being trained with much fewer epochs (5x-10x), our backbone surpasses the methods using (2+1)D and 3D convolution, and achieve comparable results with ViT on two benchmark datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes