CVOct 16, 2025

Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

arXiv:2510.14617v12 citationsh-index: 4MMSports@MM
Originality Incremental advance
AI Analysis

This addresses the problem of automated tactical analysis in badminton for sports analytics, but it is incremental as it builds on existing captioning methods for a specific domain.

The paper tackles tactical understanding in badminton by proposing Shot2Tactic-Caption, a framework for multi-scale video captioning that generates shot-level and tactic-level descriptions, and introduces a dataset with 5,494 shot captions and 544 tactic captions.

Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes