CVJan 31, 2025

XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses

Bo Lan, Pei Li, Jiaxi Yin, Yunpeng Song, Ge Wang, Han Ding, Jinsong Han, Fei Wang

arXiv:2501.19034v216.415 citationsh-index: 43Has CodeProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies

Originality Incremental advance

AI Analysis

This addresses the emerging problem of summarizing continuous human actions for applications like health monitoring and smart homes, though it is incremental with a new dataset and model.

The paper tackles action summarization using Wi-Fi and IMU signals in smart-home environments by introducing the XRF V2 dataset and XRFMamba neural network, achieving an average mAP of 78.74 for temporal action localization and an average Response Meaning Consistency of 0.802 for action summarization.

Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization using Wi-Fi and IMU signals in smart-home environments , which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and achieves the best performance with an average mAP of 78.74, outperforming the recent WiFiTAD by 5.49 points in mAP@avg while using 35% fewer parameters. In action summarization, we introduce a new metric, Response Meaning Consistency (RMC), to evaluate action summarization performance. And it achieves an average Response Meaning Consistency (mRMC) of 0.802. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more. The data and code are available at https://github.com/aiotgroup/XRFV2.

View on arXiv PDF Code

Similar