CVAILGJun 20, 2025

Do We Need Large VLMs for Spotting Soccer Actions?

arXiv:2506.17144v21 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the need for lightweight and scalable action spotting in soccer, offering a novel approach that reduces computational costs, though it is incremental in its domain-specific application.

The paper tackles the problem of soccer action spotting by shifting from video-centric to text-based methods using Large Language Models (LLMs) instead of Vision-Language Models (VLMs), achieving performance close to state-of-the-art video-based spotters with zero video processing compute and similar time requirements.

Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. We propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich descriptions and contextual cues contains sufficient information to reliably spot key actions in a match. To demonstrate this, we employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics for spotting actions in soccer matches. Our experiments show that this language-centric approach performs effectively in detecting critical match events coming close to state-of-the-art video-based spotters while using zero video processing compute and similar amount of time to process the entire match.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes