MoCA: Multi-modal Cross-masked Autoencoder for Digital Health Measurements
This provides a solution for leveraging unlabeled multi-modal wearable data in digital health, though it appears incremental as an extension of existing masked autoencoder methods.
The paper tackles the challenge of analyzing unlabeled and incomplete multi-modal wearable sensor data by proposing MoCA, a self-supervised learning framework that uses a cross-modality masking scheme, which demonstrates strong performance improvements in reconstruction and classification tasks on benchmark datasets.
Wearable devices enable continuous multi-modal physiological and behavioral monitoring, yet analysis of these data streams faces fundamental challenges including the lack of gold-standard labels and incomplete sensor data. While self-supervised learning approaches have shown promise for addressing these issues, existing multi-modal extensions present opportunities to better leverage the rich temporal and cross-modal correlations inherent in simultaneously recorded wearable sensor data. We propose the Multi-modal Cross-masked Autoencoder (MoCA), a self-supervised learning framework that combines transformer architecture with masked autoencoder (MAE) methodology, using a principled cross-modality masking scheme that explicitly leverages correlation structures between sensor modalities. MoCA demonstrates strong performance boosts across reconstruction and downstream classification tasks on diverse benchmark datasets. We further establish theoretical guarantees by establishing a fundamental connection between multi-modal MAE loss and kernelized canonical correlation analysis through a Reproducing Kernel Hilbert Space framework, providing principled guidance for correlation-aware masking strategy design. Our approach offers a novel solution for leveraging unlabeled multi-modal wearable data while handling missing modalities, with broad applications across digital health domains.