LGASAug 29, 2025

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

arXiv:2509.00221v2h-index: 3
Originality Incremental advance
AI Analysis

This work addresses data-scarce time-series problems for wearable sensor applications by leveraging pre-trained speech models, representing an incremental but practical advance in cross-modal generalization.

The authors tackled the problem of data scarcity in wearable sensor time-series tasks by showing that speech foundation models (HuBERT and wav2vec 2.0) can generalize to these domains, achieving state-of-the-art performance on mood classification, arrhythmia detection, and activity classification tasks.

Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes