CVAIApr 12

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

arXiv:2604.1039718.81 citationsh-index: 20
AI Analysis

For researchers in video understanding and human-object interaction, this work provides a unified framework and corrected benchmark that improves joint detection and anticipation, though the novelty is incremental.

This paper addresses the limitations of treating video-based human-object interaction (HOI) anticipation as a separate task from detection, and introduces DETAnt-HOI, a temporally corrected benchmark, and HOI-DA, a framework that jointly performs detection and anticipation by modeling future interactions as residual transitions. The method achieves consistent improvements in both detection and anticipation, with larger gains at longer horizons.

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes