CVAug 19, 2025

OmViD: Omni-supervised active learning for video action detection

Aayush Rana, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

arXiv:2508.13983v110.23 citationsh-index: 342025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This addresses the annotation cost challenge for video action detection researchers, presenting an incremental improvement by optimizing annotation types.

The paper tackles the problem of expensive dense annotations for video action detection by analyzing appropriate annotation types per sample and proposing an active learning strategy to estimate necessary annotations and a 3D-superpixel method for pseudo-label generation, achieving significant cost reduction with minimal performance loss on UCF101-24 and JHMDB-21 datasets.

Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.

View on arXiv PDF

Similar