Siamese Tracking with Lingual Object Constraints
This work is significant for applications like surveillance and target-specific video summarization, where monitoring objects based on predefined semantic conditions is crucial, representing an incremental step in tracking capabilities.
This paper addresses the problem of visual object tracking with additional lingual constraints, such as 'when standing near a yellow car'. The authors propose two deep models, SiamCT-DFG and SiamCT-CA, extending a state-of-the-art Siamese tracking method, and demonstrate that SiamCT-CA significantly outperforms its counterparts.
Classically, visual object tracking involves following a target object throughout a given video, and it provides us the motion trajectory of the object. However, for many practical applications, this output is often insufficient since additional semantic information is required to act on the video material. Example applications of this are surveillance and target-specific video summarization, where the target needs to be monitored with respect to certain predefined constraints, e.g., 'when standing near a yellow car'. This paper explores, tracking visual objects subjected to additional lingual constraints. Differently from Li et al., we impose additional lingual constraints upon tracking, which enables new applications of tracking. Whereas in their work the goal is to improve and extend upon tracking itself. To perform benchmarks and experiments, we contribute two datasets: c-MOT16 and c-LaSOT, curated through appending additional constraints to the frames of the original LaSOT and MOT16 datasets. We also experiment with two deep models SiamCT-DFG and SiamCT-CA, obtained through extending a recent state-of-the-art Siamese tracking method and adding modules inspired from the fields of natural language processing and visual question answering. Through experimental results, we show that the proposed model SiamCT-CA can significantly outperform its counterparts. Furthermore, our method enables the selective compression of videos, based on the validity of the constraint.