CVHCSep 12, 2019

Efficient 2.5D Hand Pose Estimation via Auxiliary Multi-Task Training for Embedded Devices

arXiv:1909.05897v1
Originality Incremental advance
AI Analysis

This enables real-time hand tracking on resource-constrained devices, but it is incremental as it builds on existing 2D key-point estimation methods.

The paper tackles efficient 2.5D hand pose estimation for embedded devices like AR/VR wearables, achieving a model with less than 300 KB memory, over 50 Hz operation, and 35 MFLOPs, while matching MobileNetV2 performance.

2D Key-point estimation is an important precursor to 3D pose estimation problems for human body and hands. In this work, we discuss the data, architecture, and training procedure necessary to deploy extremely efficient 2.5D hand pose estimation on embedded devices with highly constrained memory and compute envelope, such as AR/VR wearables. Our 2.5D hand pose estimation consists of 2D key-point estimation of joint positions on an egocentric image, captured by a depth sensor, and lifted to 2.5D using the corresponding depth values. Our contributions are two fold: (a) We discuss data labeling and augmentation strategies, the modules in the network architecture that collectively lead to $3\%$ the flop count and $2\%$ the number of parameters when compared to the state of the art MobileNetV2 architecture. (b) We propose an auxiliary multi-task training strategy needed to compensate for the small capacity of the network while achieving comparable performance to MobileNetV2. Our 32-bit trained model has a memory footprint of less than 300 Kilobytes, operates at more than 50 Hz with less than 35 MFLOPs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes