CVMar 8

AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

arXiv:2603.07758v1
Predicted impact top 50% in CV · last 90 daysOriginality Highly original
AI Analysis

This work provides a strong specific gain for object re-identification and long-term tracking in fixed-view video surveillance and human-computer interaction scenarios.

This paper addresses the challenge of long-term language-guided referring in fixed-view videos where objects may be occluded or leave the scene. The proposed AR2-4FV system improves Re-Capture Rate by 10.3% and reduces Re-Capture Latency by 24.2% compared to the best baseline.

Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes