Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion
This work addresses player tracking for sports analytics in challenging single-feed basketball videos, presenting an incremental improvement by showing that deep learning features alone can suffice without additional contextual cues.
The paper tackled the problem of tracking basketball players in single-camera videos with cluttering and occlusions by developing a tracker that fuses pose and semantic features, achieving performance measured in MOTA on a dataset with over 10k instances.
Tracking sports players is a widely challenging scenario, specially in single-feed videos recorded in tight courts, where cluttering and occlusions cannot be avoided. This paper presents an analysis of several geometric and semantic visual features to detect and track basketball players. An ablation study is carried out and then used to remark that a robust tracker can be built with Deep Learning features, without the need of extracting contextual ones, such as proximity or color similarity, nor applying camera stabilization techniques. The presented tracker consists of: (1) a detection step, which uses a pretrained deep learning model to estimate the players pose, followed by (2) a tracking step, which leverages pose and semantic information from the output of a convolutional layer in a VGG network. Its performance is analyzed in terms of MOTA over a basketball dataset with more than 10k instances.