CVNov 14, 2025

SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction

arXiv:2511.11824v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses real-time perception problems for applications like autonomous driving or surveillance, but it appears incremental as it builds on prior transformer-based tracking models with specific improvements.

The paper tackles the challenge of accurate single-object tracking and short-term motion forecasting under occlusion and scale variation by introducing SOTFormer, a minimal transformer that unifies detection, tracking, and trajectory prediction, achieving 76.3 AUC and 53.7 FPS on the Mini-LaSOT benchmark.

Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes