ContextHOI: Spatial Context Learning for Human-Object Interaction Detection
It addresses a domain-specific problem in computer vision for HOI detection, particularly improving accuracy in ambiguous cases like occlusions, with incremental advancements over existing transformer-based methods.
The paper tackles the problem of insufficient spatial context exploration in Human-Object Interaction (HOI) detection, especially for occluded or blurred instances, by proposing ContextHOI, a dual-branch framework that captures object detection features and spatial contexts, achieving state-of-the-art performance on HICO-DET and v-coco benchmarks.
Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.