CVApr 15, 2024

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, Ngan Le

arXiv:2404.09951v110.511 citationsh-index: 2Has CodeIJCNN

Originality Highly original

AI Analysis

This addresses the problem of precise action spotting in sports videos for computer vision researchers, offering an incremental improvement over existing methods by better handling small objects and scene nuances.

The paper tackles the problem of detecting actions in sports videos, which is challenging due to cluttered backgrounds and small objects, by introducing a novel approach that models both global environment features and local scene entities using adaptive attention mechanisms. The method achieved state-of-the-art results, securing 1st place in three benchmarks with performance improvements of 1.6, 2.0, and 1.3 points in avg-mAP compared to runner-up methods.

Sports videos pose complex challenges, including cluttered backgrounds, camera angle changes, small action-representing objects, and imbalanced action class distribution. Existing methods for detecting actions in sports videos heavily rely on global features, utilizing a backbone network as a black box that encompasses the entire spatial frame. However, these approaches tend to overlook the nuances of the scene and struggle with detecting actions that occupy a small portion of the frame. In particular, they face difficulties when dealing with action classes involving small objects, such as balls or yellow/red cards in soccer, which only occupy a fraction of the screen space. To address these challenges, we introduce a novel approach that analyzes and models scene entities using an adaptive attention mechanism. Particularly, our model disentangles the scene content into the global environment feature and local relevant scene entities feature. To efficiently extract environmental features while considering temporal information with less computational cost, we propose the use of a 2D backbone network with a time-shift mechanism. To accurately capture relevant scene entities, we employ a Vision-Language model in conjunction with the adaptive attention mechanism. Our model has demonstrated outstanding performance, securing the 1st place in the SoccerNet-v2 Action Spotting, FineDiving, and FineGym challenge with a substantial performance improvement of 1.6, 2.0, and 1.3 points in avg-mAP compared to the runner-up methods. Furthermore, our approach offers interpretability capabilities in contrast to other deep learning models, which are often designed as black boxes. Our code and models are released at: https://github.com/Fsoft-AIC/unifying-global-local-feature.

View on arXiv PDF Code

Similar