CVOct 20, 2021

GTM: Gray Temporal Model for Video Recognition

arXiv:2110.10348v1
Originality Incremental advance
AI Analysis

This work addresses efficiency and modeling challenges in video recognition for researchers and practitioners, though it is incremental as it builds on existing input modalities.

The paper tackles video action recognition by introducing a new input modality called gray stream, which uses stacked consecutive gray images to skip RGB conversion and enhance spatio-temporal modeling without extra computation or parameters, achieving impressive results on benchmarks like Kinetics and UCF-101.

Data input modality plays an important role in video action recognition. Normally, there are three types of input: RGB, flow stream and compressed data. In this paper, we proposed a new input modality: gray stream. Specifically, taken the stacked consecutive 3 gray images as input, which is the same size of RGB, can not only skip the conversion process from video decoding data to RGB, but also improve the spatio-temporal modeling ability at zero computation and zero parameters. Meanwhile, we proposed a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC) which captures the temporal relationship at channel-feature level within a controllable computation budget(by parameters G & R). Finally, we confirm its effectiveness and efficiency on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB-51 and UCF-101, and achieve impressive results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes