CVMar 23, 2025

An Image-like Diffusion Method for Human-Object Interaction Detection

arXiv:2503.18134v12 citationsh-index: 9CVPR
Originality Incremental advance
AI Analysis

This work addresses a domain-specific problem in computer vision for researchers and practitioners, offering a novel approach to improve detection accuracy in cluttered scenes.

The paper tackles the challenge of human-object interaction detection, which is ambiguous due to variations in appearance and issues like occlusions, by proposing HOI-IDiff, a framework that uses an image-like diffusion process to generate detection outputs as images, achieving state-of-the-art results on benchmark datasets.

Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes