No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques
This work addresses the problem of detecting interactions between humans and objects in images, which is crucial for applications like robotics and surveillance, but it is incremental as it builds on existing pre-trained detectors and training techniques.
The paper tackled human-object interaction detection by proposing a simple factorized model with appearance and layout encodings, which outperformed more complex methods, achieving state-of-the-art results on the HICO-Det dataset.
We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.