ASLGSDJan 29, 2025

Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models

arXiv:2502.01649v15 citationsh-index: 3SEC
Originality Incremental advance
AI Analysis

This addresses privacy concerns for users of cloud-based speech recognition by enabling on-device filtering of sensitive content, though it is an incremental improvement over prior privacy-preserving frameworks.

The paper tackles the problem of preserving privacy in speech recognition systems by filtering sensitive entities on resource-constrained devices without compromising transcription accuracy. It achieves state-of-the-art performance with <100 MB memory, filtering 83% of private entities on-device while reducing word error rate by 38.8-77.5% compared to existing services.

Robust speech recognition systems rely on cloud service providers for inference. It needs to ensure that an untrustworthy provider cannot deduce the sensitive content in speech. Sanitization can be done on speech content keeping in mind that it has to avoid compromising transcription accuracy. Realizing the under utilized capabilities of tiny speech foundation models (FMs), for the first time, we propose a novel use: enhancing speech privacy on resource-constrained devices. We introduce XYZ, an edge/cloud privacy preserving speech inference engine that can filter sensitive entities without compromising transcript accuracy. We utilize a timestamp based on-device masking approach that utilizes a token to entity prediction model to filter sensitive entities. Our choice of mask strategically conceals parts of the input and hides sensitive data. The masked input is sent to a trusted cloud service or to a local hub to generate the masked output. The effectiveness of XYZ hinges on how well the entity time segments are masked. Our recovery is a confidence score based approach that chooses the best prediction between cloud and on-device model. We implement XYZ on a 64 bit Raspberry Pi 4B. Experiments show that our solution leads to robust speech recognition without forsaking privacy. XYZ with < 100 MB memory, achieves state-of-the-art (SOTA) speech transcription performance while filtering about 83% of private entities directly on-device. XYZ is 16x smaller in memory and 17x more compute efficient than prior privacy preserving speech frameworks and has a relative reduction in word error rate (WER) by 38.8-77.5% when compared to existing offline transcription services.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes