Interacted Object Grounding in Spatio-Temporal Human-Object Interactions
This addresses the limitation of current vision systems in localizing diverse objects for spatio-temporal human-object interaction understanding, though it is incremental as it builds on existing grounding methods with a new benchmark.
The paper tackles the problem of detecting diverse and rare interacted objects in videos by introducing a new open-world benchmark (GIO) with 1,098 object classes and 290K annotations, and proposes a 4D question-answering framework that shows significant superiority over baselines in experiments.
Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today's detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines. Data and code will be publicly available at https://github.com/DirtyHarryLYL/HAKE-AVA.