CLMay 27

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

arXiv:2605.2821159.7
AI Analysis

This work highlights an overlooked privacy vulnerability in customised speech models for professionals who rely on domain adaptation, providing a controlled evaluation and mitigation analysis.

The paper identifies a privacy risk in domain-adapted SpeechLLMs where models can leak sensitive information by transcribing phonetically similar words from context or training data instead of the spoken word. Experiments show measurable leakage rates for both prompting and fine-tuning, with combined approaches increasing risk, and fine-tuning without context prompts offers the best accuracy-leakage trade-off.

SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes