CVLGAug 25, 2025

Large VLM-based Stylized Sports Captioning

arXiv:2508.19295v1h-index: 9
Originality Incremental advance
AI Analysis

This work addresses the need for production-grade, human-like sports captioning for applications like live journalism, though it is incremental as it builds on existing LVLM methods.

The paper tackled the problem of generating accurate and stylized sports captions from images, which existing large language models lack due to insufficient sports jargon, and proposed a two-level fine-tuned LVLM pipeline that improved F1 by >8-10% and BERT score by >2-10% compared to alternatives, with practical application in live sports journalism during the Super Bowl.

The advent of large (visual) language models (LLM / LVLM) have led to a deluge of automated human-like systems in several domains including social media content generation, search and recommendation, healthcare prognosis, AI assistants for cognitive tasks etc. Although these systems have been successfully integrated in production; very little focus has been placed on sports, particularly accurate identification and natural language description of the game play. Most existing LLM/LVLMs can explain generic sports activities, but lack sufficient domain-centric sports' jargon to create natural (human-like) descriptions. This work highlights the limitations of existing SoTA LLM/LVLMs for generating production-grade sports captions from images in a desired stylized format, and proposes a two-level fine-tuned LVLM pipeline to address that. The proposed pipeline yields an improvement > 8-10% in the F1, and > 2-10% in BERT score compared to alternative approaches. In addition, it has a small runtime memory footprint and fast execution time. During Super Bowl LIX the pipeline proved its practical application for live professional sports journalism; generating highly accurate and stylized captions at the rate of 6 images per 3-5 seconds for over 1000 images during the game play.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes