CVSep 19, 2025

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

arXiv:2509.15602v23 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of evaluating MLLMs in sports analytics, but it is incremental as it focuses on benchmarking rather than novel model improvements.

The authors tackled the problem of multimodal large language models (MLLMs) struggling with fast, high-frequency sports like tennis by introducing TennisTV, the first comprehensive benchmark for tennis video understanding, and found that evaluating 17 MLLMs revealed substantial shortcomings, with key insights on frame-sampling density and temporal grounding.

Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 9 tasks from the stroke level to the rally level and includes 2943 human-verified questions. Evaluating 17 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results reveal substantial shortcomings and yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.

View on arXiv PDF

Similar