Video BagNet: short temporal receptive fields increase robustness in long-term action recognition
This work addresses robustness issues in video action recognition for applications like surveillance or sports analysis, but it is incremental as it modifies an existing model (3D ResNet-50) rather than introducing a new paradigm.
The paper tackled the problem of long-term video action recognition models being sensitive to changes in sub-action order due to large temporal receptive fields, and found that limiting the receptive field to short durations (e.g., 1-33 frames) increased robustness, with experiments showing performance improvements on synthetic and real-world datasets.
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.