CVMar 2, 2023

Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning

Bo Wan, Yongfei Liu, Desen Zhou, Tinne Tuytelaars, Xuming He

arXiv:2303.01313v110.417 citationsh-index: 75

Originality Incremental advance

AI Analysis

This addresses the problem of scalable HOI detection for human-centric scene understanding, but it is incremental as it builds on prior weak supervision and CLIP-based methods.

The paper tackled the problem of weakly-supervised human-object interaction (HOI) detection, which is challenging due to ambiguous associations and noisy training signals, and achieved results that outperform previous works by a sizable margin on HICO-DET and V-COCO benchmarks.

Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks. One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only. This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal. A promising strategy to address those challenges is to exploit knowledge from large-scale pretrained models (e.g., CLIP), but a direct knowledge distillation strategy~\citep{liao2022gen} does not perform well on the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations. Experimental results on HICO-DET and V-COCO show that our method outperforms the previous works by a sizable margin, showing the efficacy of our HOI representation.

View on arXiv PDF

Similar