Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
This addresses the challenge of efficiently understanding long videos for applications in video analysis, though it is an incremental improvement over existing sampling techniques.
The paper tackles the problem of processing long-form videos by proposing an adaptive sampling method based on Kernel Temporal Segmentation, which replaces uniform sampling to better capture semantically consistent segments. The method achieves state-of-the-art performance on tasks like video classification and temporal action localization, showing consistent gains over existing approaches.
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.