Egocentric Hand-object Interaction Detection and Application
This addresses the problem of efficient real-time interaction detection for applications like activity segmentation, though it is incremental with competitive rather than groundbreaking gains.
The paper tackles hand-object interaction detection from an egocentric perspective, achieving 89% accuracy comparable to prior work but with significantly improved real-time performance at over 30 FPS versus 1-2 FPS, and it applies this to segment script-less activities with F1 scores of 68.2% and 82.8% on benchmark datasets.
In this paper, we present a method to detect the hand-object interaction from an egocentric perspective. In contrast to massive data-driven discriminator based method like \cite{Shan20}, we propose a novel workflow that utilises the cues of hand and object. Specifically, we train networks predicting hand pose, hand mask and in-hand object mask to jointly predict the hand-object interaction status. We compare our method with the most recent work from Shan et al. \cite{Shan20} on selected images from EPIC-KITCHENS \cite{damen2018scaling} dataset and achieve $89\%$ accuracy on HOI (hand-object interaction) detection which is comparative to Shan's ($92\%$). However, for real-time performance, with the same machine, our method can run over $\textbf{30}$ FPS which is much efficient than Shan's ($\textbf{1}\sim\textbf{2}$ FPS). Furthermore, with our approach, we are able to segment script-less activities from where we extract the frames with the HOI status detection. We achieve $\textbf{68.2\%}$ and $\textbf{82.8\%}$ F1 score on GTEA \cite{fathi2011learning} and the UTGrasp \cite{cai2015scalable} dataset respectively which are all comparative to the SOTA methods.