CVAIMar 17, 2024

Domain-Guided Masked Autoencoders for Unique Player Identification

arXiv:2403.11328v12 citationsh-index: 48Proceedings of the 21st Conference on Robots and Vision
Originality Incremental advance
AI Analysis

This work addresses a domain-specific problem in sports analytics for tasks like player assessment and broadcast production, with incremental advancements in masking techniques and network design.

The paper tackled the problem of unique player identification from broadcast sports videos, which is challenging due to motion blur, low resolution, and occlusions, by proposing a domain-guided masked autoencoder (d-MAE) and a spatio-temporal network, achieving state-of-the-art improvements of 8.58%, 4.29%, and 1.20% in test set accuracies on three datasets.

Unique player identification is a fundamental module in vision-driven sports analytics. Identifying players from broadcast videos can aid with various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatic detection of jersey numbers using deep features is challenging primarily due to: a) motion blur, b) low resolution video feed, and c) occlusions. With their recent success in various vision tasks, masked autoencoders (MAEs) have emerged as a superior alternative to conventional feature extractors. However, most MAEs simply zero-out image patches either randomly or focus on where to mask rather than how to mask. Motivated by human vision, we devise a novel domain-guided masking policy for MAEs termed d-MAE to facilitate robust feature extraction in the presence of motion blur for player identification. We further introduce a new spatio-temporal network leveraging our novel d-MAE for unique player identification. We conduct experiments on three large-scale sports datasets, including a curated baseball dataset, the SoccerNet dataset, and an in-house ice hockey dataset. We preprocess the datasets using an upgraded keyframe identification (KfID) module by focusing on frames containing jersey numbers. Additionally, we propose a keyframe-fusion technique to augment keyframes, preserving spatial and temporal context. Our spatio-temporal network showcases significant improvements, surpassing the current state-of-the-art by 8.58%, 4.29%, and 1.20% in the test set accuracies, respectively. Rigorous ablations highlight the effectiveness of our domain-guided masking approach and the refined KfID module, resulting in performance enhancements of 1.48% and 1.84% respectively, compared to original architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes