Skeleton-based Action Recognition of People Handling Objects
This addresses a specific need in visual surveillance for improved action recognition when objects are involved, representing an incremental advance in skeleton-based methods.
The paper tackles the problem of recognizing human actions involving objects, such as handling a phone or cup, by proposing a framework that uses graph convolutional networks on skeletal graphs from human and object poses, and it reports outperforming state-of-the-art methods in experiments.
In visual surveillance systems, it is necessary to recognize the behavior of people handling objects such as a phone, a cup, or a plastic bag. In this paper, to address this problem, we propose a new framework for recognizing object-related human actions by graph convolutional networks using human and object poses. In this framework, we construct skeletal graphs of reliable human poses by selectively sampling the informative frames in a video, which include human joints with high confidence scores obtained in pose estimation. The skeletal graphs generated from the sampled frames represent human poses related to the object position in both the spatial and temporal domains, and these graphs are used as inputs to the graph convolutional networks. Through experiments over an open benchmark and our own data sets, we verify the validity of our framework in that our method outperforms the state-of-the-art method for skeleton-based action recognition.