CVSep 5, 2023Code
BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye ViewYuxiang Yang, Yingqi Deng, Mian Pan et al.
3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving. However, existing algorithms often involve complex designs and multiple loss functions, making model training and deployment challenging. Furthermore, their reliance on fixed probability distribution assumptions (e.g., Laplacian or Gaussian) hinders their ability to adapt to diverse target characteristics such as varying sizes and motion patterns, ultimately affecting tracking precision and robustness. To address these issues, we propose BEVTrack, a simple yet effective motion-based tracking method. BEVTrack directly estimates object motion in Bird's-Eye View (BEV) using a single regression loss. To enhance accuracy for targets with diverse attributes, it learns adaptive likelihood functions tailored to individual targets, avoiding the limitations of fixed distribution assumptions in previous methods. This approach provides valuable priors for tracking and significantly boosts performance. Comprehensive experiments on KITTI, NuScenes, and Waymo Open Dataset demonstrate that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability. The code will be released at https://github.com/xmm-prio/BEVTrack.
CVAug 3, 2024Code
SiamMo: Siamese Motion-Centric 3D Object TrackingYuxiang Yang, Yingqi Deng, Jing Zhang et al.
Current 3D single object tracking methods primarily rely on the Siamese matching-based paradigm, which struggles with textureless and incomplete LiDAR point clouds. Conversely, the motion-centric paradigm avoids appearance matching, thus overcoming these issues. However, its complex multi-stage pipeline and the limited temporal modeling capability of a single-stream architecture constrain its potential. In this paper, we introduce SiamMo, a novel and simple Siamese motion-centric tracking approach. Unlike the traditional single-stream architecture, we employ Siamese feature extraction for motion-centric tracking. This decouples feature extraction from temporal fusion, significantly enhancing tracking performance. Additionally, we design a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively. We also introduce a Box-aware Feature Encoding module to encode object size priors into motion estimation. SiamMo is a purely motion-centric tracker that eliminates the need for additional processes like segmentation and box refinement. Without whistles and bells, SiamMo not only surpasses state-of-the-art methods across multiple benchmarks but also demonstrates exceptional robustness in challenging scenarios. SiamMo sets a new record on the KITTI tracking benchmark with 90.1\% precision while maintaining a high inference speed of 108 FPS. The code will be released at https://github.com/HDU-VRLab/SiamMo.
CVDec 25, 2023Code
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondYuxiang Yang, Yingqi Deng, Yufei Xu et al.
Animal Pose Estimation and Tracking (APT) is a critical task in detecting and monitoring the keypoints of animals across a series of video frames, which is essential for understanding animal behavior. Past works relating to animals have primarily focused on either animal tracking or single-frame animal pose estimation only, neglecting the integration of both aspects. The absence of comprehensive APT datasets inhibits the progression and evaluation of animal pose estimation and tracking methods based on videos, thereby constraining their real-world applications. To fill this gap, we introduce APTv2, the pioneering large-scale benchmark for animal pose estimation and tracking. APTv2 comprises 2,749 video clips filtered and collected from 30 distinct animal species. Each video clip includes 15 frames, culminating in a total of 41,235 frames. Following meticulous manual annotation and stringent verification, we provide high-quality keypoint and tracking annotations for a total of 84,611 animal instances, split into easy and hard subsets based on the number of instances that exists in the frame. With APTv2 as the foundation, we establish a simple baseline method named \posetrackmethodname and provide benchmarks for representative models across three tracks: (1) single-frame animal pose estimation track to evaluate both intra- and inter-domain transfer learning performance, (2) low-data transfer and generalization track to evaluate the inter-species domain generalization performance, and (3) animal pose tracking track. Our experimental results deliver key empirical insights, demonstrating that APTv2 serves as a valuable benchmark for animal pose estimation and tracking. It also presents new challenges and opportunities for future research. The code and dataset are released at \href{https://github.com/ViTAE-Transformer/APTv2}{https://github.com/ViTAE-Transformer/APTv2}.