CVMar 17, 2025

Action tube generation by person query matching for spatio-temporal action detection

arXiv:2503.12969v1h-index: 2VISIGRAPP : VISAPP
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and accurate action detection in videos for applications like surveillance or video analysis, though it is incremental as it builds on existing query-based methods.

The paper tackles spatio-temporal action detection by directly generating action tubes from videos without post-processing, using query-based detection and a Query Matching Module to link people across frames, achieving strong performance on benchmarks like JHMDB, UCF101-24, and AVA with improved computational efficiency and lower resource requirements.

This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes