CVApr 2, 2022

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

arXiv:2204.00746v260 citationsh-index: 22
AI Analysis

This work addresses the problem of accurately detecting interactions between humans and objects in images for computer vision applications, representing an incremental improvement over existing Transformer-based methods.

The authors tackled human-object interaction detection by proposing a semantic and spatial refined transformer (SSRT) that introduces modules to select relevant object-action pairs and refine query representations, achieving state-of-the-art results on V-COCO and HICO-DET benchmarks.

We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes