CVApr 15, 2024

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Siddhant Bansal, Michael Wray, Dima Damen

arXiv:2404.09933v114.112 citationsh-index: 43

Originality Incremental advance

AI Analysis

This work addresses the challenge of visual referral in egocentric vision for applications like robotics or AR, but it is incremental as it adapts existing VLMs to a new dataset and task.

The paper tackles the problem of understanding hand-object interactions in egocentric images by proposing the HOI-Ref task and curating the HOI-QA dataset with 3.9M question-answer pairs, resulting in performance improvements of 27.9% for referring hands and objects and 26.7% for interactions when fine-tuning VLMs on this dataset.

Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.

View on arXiv PDF

Similar