Learnable Sampling 3D Convolution for Video Enhancement and Action Recognition
This work addresses the problem of robust temporal information fusion in video processing for tasks like enhancement and action recognition, potentially benefiting researchers and practitioners working with video data.
This paper proposes learnable sampling 3D convolution (LS3D-Conv) to fuse multi-level features across neighboring frames for video enhancement and action recognition. It adds learnable 2D offsets to 3D convolution, allowing flexible sampling of spatial feature map locations across frames, and demonstrates effectiveness across multiple video tasks.
A key challenge in video enhancement and action recognition is to fuse useful information from neighboring frames. Recent works suggest establishing accurate correspondences between neighboring frames before fusing temporal information. However, the generated results heavily depend on the quality of correspondence estimation. In this paper, we propose a more robust solution: \emph{sampling and fusing multi-level features} across neighborhood frames to generate the results. Based on this idea, we introduce a new module to improve the capability of 3D convolution, namely, learnable sampling 3D convolution (\emph{LS3D-Conv}). We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames. The offsets can be learned for specific tasks. The \emph{LS3D-Conv} can flexibly replace 3D convolution layers in existing 3D networks and get new architectures, which learns the sampling at multiple feature levels. The experiments on video interpolation, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.