CVApr 10

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

arXiv:2604.0916438.8h-index: 37
Predicted impact top 79% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses scalability issues in real-world video analysis for applications like video understanding, representing an incremental improvement over prior SSM-based and other structural methods.

The paper tackles the problem of temporal human action detection in untrimmed videos by addressing feature redundancy and global dependency modeling limitations in existing methods, resulting in a novel framework that significantly enhances localization performance and robustness as demonstrated through extensive experiments.

Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes