TennisExpert: Towards Expert-Level Analytical Sports Video Understanding
This work addresses the underexplored area of expert-level sports video analysis for professional analysis, automated coaching, and real-time commentary, though it is incremental as it builds on existing multimodal methods.
The paper tackled the problem of automatic tennis video understanding by introducing TennisVL, a large-scale benchmark with expert analytical commentary, and TennisExpert, a multimodal framework that outperforms proprietary baselines like GPT-5, Gemini, and Claude in capturing tactical context and match dynamics.
Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.