ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon

arXiv:2604.1276271.5h-index: 10

AI Analysis

For researchers in multi-camera person search and embodied AI, ARGOS provides a challenging benchmark that exposes the limitations of current LLMs in interactive spatio-temporal reasoning.

ARGOS reformulates multi-camera person search as an interactive reasoning problem where an agent must plan and query under information asymmetry. The benchmark reveals that current LLMs are far from solving it, with best TWS scores of 0.383 and 0.590 on spatial and temporal tracks, and tool removal drops accuracy by up to 49.6 percentage points.

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

View on arXiv PDF

Similar