CVDec 4, 2017

Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions

arXiv:1712.01090v240 citations

Originality Incremental advance

AI Analysis

This work addresses a domain-specific problem in 3D action recognition for applications like human-computer interaction and surveillance, with incremental improvements over existing methods.

The paper tackles the challenge of recognizing similar actions in 3D action recognition by proposing a two-layer Bag-of-Visual-Words model that jointly encodes motion and shape cues, achieving robust performance against noise and clutter.

3D action recognition has broad applications in human-computer interaction and intelligent surveillance. However, recognizing similar actions remains challenging since previous literature fails to capture motion and shape cues effectively from noisy depth data. In this paper, we propose a novel two-layer Bag-of-Visual-Words (BoVW) model, which suppresses the noise disturbances and jointly encodes both motion and shape cues. First, background clutter is removed by a background modeling method that is designed for depth data. Then, motion and shape cues are jointly used to generate robust and distinctive spatial-temporal interest points (STIPs): motion-based STIPs and shape-based STIPs. In the first layer of our model, a multi-scale 3D local steering kernel (M3DLSK) descriptor is proposed to describe local appearances of cuboids around motion-based STIPs. In the second layer, a spatial-temporal vector (STV) descriptor is proposed to describe the spatial-temporal distributions of shape-based STIPs. Using the Bag-of-Visual-Words (BoVW) model, motion and shape cues are combined to form a fused action representation. Our model performs favorably compared with common STIP detection and description methods. Thorough experiments verify that our model is effective in distinguishing similar actions and robust to background clutter, partial occlusions and pepper noise.

View on arXiv PDF

Similar