CVMay 6, 2025

Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges

arXiv:2505.03991v34 citationsh-index: 18
Originality Synthesis-oriented
AI Analysis

It addresses the problem of unclear task definitions and limited practical applicability in sports video event detection for researchers and industry practitioners, but it is incremental as a survey paper.

This survey tackles the confusion and gaps in sports video event detection by clearly delineating tasks like Temporal Action Localization, Action Spotting, and Precise Event Spotting, and provides a structured taxonomy of methods and critical assessment of datasets, aiming to improve temporally precise and deployable systems.

Video event detection has become a cornerstone of modern sports analytics, powering automated performance evaluation, content generation, and tactical decision-making. Recent advances in deep learning have driven progress in related tasks such as Temporal Action Localization (TAL), which detects extended action segments; Action Spotting (AS), which identifies a representative timestamp; and Precise Event Spotting (PES), which pinpoints the exact frame of an event. Although closely connected, their subtle differences often blur the boundaries between them, leading to confusion in both research and practical applications. Furthermore, prior surveys either address generic video event detection or broader sports video tasks, but largely overlook the unique temporal granularity and domain-specific challenges of event spotting. In addition, most existing sports video surveys focus on elite-level competitions while neglecting the wider community of everyday practitioners. This survey addresses these gaps by: (i) clearly delineating TAL, AS, and PES and their respective use cases; (ii) introducing a structured taxonomy of state of the art approaches including temporal modeling strategies, multimodal frameworks, and data-efficient pipelines tailored for AS and PES; and (iii) critically assessing benchmark datasets and evaluation protocols, highlighting limitations such as reliance on broadcast quality footage and metrics that over reward permissive multilabel predictions. By synthesizing current research and exposing open challenges, this work provides a comprehensive foundation for developing temporally precise, generalizable, and practically deployable sports event detection systems for both the research and industry communities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes