Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation
It addresses captioning for soccer videos, a domain-specific task, with incremental improvements in model components and evaluation.
This work tackled the problem of generating captions for soccer videos by introducing a dataset with 22k caption-clip pairs and three visual features, and a model combining transformers and ConvNets with semantics-related losses, resulting in a 28% improvement in normalized captioning score and increased caption diversity from 0.07 to 0.18.
This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words. Semantics-related losses and the utilization of more visual features (optical flow, inpainting) improved the normalized captioning score by 28\%. The web page of this work: https://sites.google.com/view/soccercaptioning}{https://sites.google.com/view/soccercaptioning