Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
This addresses the problem of localizing and tracking objects based on language queries in videos for computer vision researchers, but it is incremental as it combines existing models without new training.
The paper tackled the MOT25-StAG Challenge by modeling it as a video retrieval problem and using a two-stage, zero-shot approach combining FastTracker and LLaVA-Video, achieving m-HIoU and HOTA scores of 20.68 and 10.73 to win second place.
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.