CVJan 5

AR-MOT: Autoregressive Multi-object Tracking

arXiv:2601.01925v11 citationsh-index: 9
Originality Highly original
AI Analysis

This addresses the problem of limited adaptability in multi-object tracking for researchers and practitioners, offering a more generalizable approach, though it is incremental in improving flexibility rather than raw performance.

The paper tackles the inflexibility of existing multi-object tracking methods by proposing AR-MOT, an autoregressive paradigm that formulates tracking as a sequence generation task using a large language model, achieving performance comparable to state-of-the-art methods on MOT17 and DanceTrack benchmarks.

As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes