CVOct 26, 2024

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

arXiv:2410.20155v124 citationsh-index: 22NIPS
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately detecting interactions between humans and objects in images, which is crucial for applications in robotics and surveillance, and represents an incremental improvement by adapting diffusion models to this domain.

The paper tackles the problem of human-object interaction detection by leveraging text-to-image diffusion models to capture mid/low-level visual cues and compositional reasoning, achieving state-of-the-art performance on three datasets in regular and zero-shot setups.

Prevalent human-object interaction (HOI) detection approaches typically leverage large-scale visual-linguistic models to help recognize events involving humans and objects. Though promising, models trained via contrastive learning on text-image pairs often neglect mid/low-level visual cues and struggle at compositional reasoning. In response, we introduce DIFFUSIONHOI, a new HOI detector shedding light on text-to-image diffusion models. Unlike the aforementioned models, diffusion models excel in discerning mid/low-level visual concepts as generative models, and possess strong compositionality to handle novel concepts expressed in text inputs. Considering diffusion models usually emphasize instance objects, we first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions, and extract HOI-relevant cues from images without heavy fine-tuning. Benefited from above, DIFFUSIONHOI achieves SOTA performance on three datasets under both regular and zero-shot setups.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes