CVAIJun 3, 2024

Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs

arXiv:2406.01316v213 citations
AI Analysis

This addresses data scarcity in wearable HAR, particularly for subtle motions, but is incremental as it builds on prior video-to-IMU synthesis methods.

The paper tackled the problem of limited labeled sensor data for Human Activity Recognition (HAR) by proposing Multi$^3$Net, a framework that learns joint representations from text, pose, and synthetic IMU data from videos, resulting in improved performance for recognizing fine-grained activities, with models surpassing existing approaches.

Due to the scarcity of labeled sensor data in HAR, prior research has turned to video data to synthesize Inertial Measurement Units (IMU) data, capitalizing on its rich activity annotations. However, generating IMU data from videos presents challenges for HAR in real-world settings, attributed to the poor quality of synthetic IMU data and its limited efficacy in subtle, fine-grained motions. In this paper, we propose Multi$^3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data. Our pretraining procedure uses videos from online repositories, aiming to learn joint representations of text, pose, and IMU simultaneously. By employing video data and contrastive learning, our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.Our experimental findings validate the effectiveness of our approach in improving HAR performance with IMU data. We demonstrate that models trained with synthetic IMU data generated from videos using our method surpass existing approaches in recognizing fine-grained activities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes