CVAIFeb 11, 2022

Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

arXiv:2202.05728v210 citations
Originality Synthesis-oriented
AI Analysis

It addresses captioning for soccer videos, a domain-specific task, with incremental improvements in model components and evaluation.

This work tackled the problem of generating captions for soccer videos by introducing a dataset with 22k caption-clip pairs and three visual features, and a model combining transformers and ConvNets with semantics-related losses, resulting in a 28% improvement in normalized captioning score and increased caption diversity from 0.07 to 0.18.

This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words. Semantics-related losses and the utilization of more visual features (optical flow, inpainting) improved the normalized captioning score by 28\%. The web page of this work: https://sites.google.com/view/soccercaptioning}{https://sites.google.com/view/soccercaptioning

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes