Learning Multi-Object Tracking and Segmentation from Automatic Annotations
This work addresses the scalability issue in training data creation for MOTS, benefiting researchers and practitioners in computer vision by reducing reliance on costly manual annotations.
The authors tackled the problem of expensive manual annotation for multi-object tracking and segmentation by developing an automatic pipeline to generate training data and a new deep learning architecture, resulting in improved sMOTSA scores by up to +7.5% on datasets like KITTI MOTS and MOTSChallenge.
In this work we contribute a novel pipeline to automatically generate training data, and to improve over state-of-the-art multi-object tracking and segmentation (MOTS) methods. Our proposed track mining algorithm turns raw street-level videos into high-fidelity MOTS training data, is scalable and overcomes the need of expensive and time-consuming manual annotation approaches. We leverage state-of-the-art instance segmentation results in combination with optical flow predictions, also trained on automatically harvested training data. Our second major contribution is MOTSNet - a deep learning, tracking-by-detection architecture for MOTS - deploying a novel mask-pooling layer for improved object association over time. Training MOTSNet with our automatically extracted data leads to significantly improved sMOTSA scores on the novel KITTI MOTS dataset (+1.9%/+7.5% on cars/pedestrians), and MOTSNet improves by +4.1% over previously best methods on the MOTSChallenge dataset. Our most impressive finding is that we can improve over previous best-performing works, even in complete absence of manually annotated MOTS training data.