Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition
This addresses the problem of limited range in gesture recognition for human-computer interaction, offering a solution that works in challenging conditions like low illumination, but it is incremental as it builds on existing graph and temporal modeling methods.
The paper tackles long-range gesture recognition by proposing a cross-modal learning architecture that transfers knowledge from camera to radar point clouds, achieving 98.4% accuracy for five gestures at distances of 1-2 meters.
Gesture recognition is one of the most intuitive ways of interaction and has gathered particular attention for human computer interaction. Radar sensors possess multiple intrinsic properties, such as their ability to work in low illumination, harsh weather conditions, and being low-cost and compact, making them highly preferable for a gesture recognition solution. However, most literature work focuses on solutions with a limited range that is lower than a meter. We propose a novel architecture for a long-range (1m - 2m) gesture recognition solution that leverages a point cloud-based cross-learning approach from camera point cloud to 60-GHz FMCW radar point cloud, which allows learning better representations while suppressing noise. We use a variant of Dynamic Graph CNN (DGCNN) for the cross-learning, enabling us to model relationships between the points at a local and global level and to model the temporal dynamics a Bi-LSTM network is employed. In the experimental results section, we demonstrate our model's overall accuracy of 98.4% for five gestures and its generalization capability.