CVAug 11, 2023

Compositional Learning in Transformer-Based Human-Object Interaction Detection

Zikun Zhuang, Ruihao Qian, Chi Xie, Shuang Liang

arXiv:2308.05961v13.95 citationsh-index: 24

Originality Incremental advance

AI Analysis

This work addresses the challenge of few-shot and zero-shot learning in HOI detection for visual scene understanding, representing an incremental improvement over existing compositional methods.

The paper tackles the long-tailed distribution problem in human-object interaction detection by proposing a transformer-based compositional learning framework that re-composes human-object pair and interaction representations without auxiliary information, achieving state-of-the-art performance with significant gains on rare classes.

Human-object interaction (HOI) detection is an important part of understanding human activities and visual scenes. The long-tailed distribution of labeled instances is a primary challenge in HOI detection, promoting research in few-shot and zero-shot learning. Inspired by the combinatorial nature of HOI triplets, some existing approaches adopt the idea of compositional learning, in which object and action features are learned individually and re-composed as new training samples. However, these methods follow the CNN-based two-stage paradigm with limited feature extraction ability, and often rely on auxiliary information for better performance. Without introducing any additional information, we creatively propose a transformer-based framework for compositional HOI learning. Human-object pair representations and interaction representations are re-composed across different HOI instances, which involves richer contextual information and promotes the generalization of knowledge. Experiments show our simple but effective method achieves state-of-the-art performance, especially on rare HOI classes.

View on arXiv PDF

Similar