ASCLLGMar 27

PHONOS: PHOnetic Neutralization for Online Streaming Applications

arXiv:2603.2700157.6h-index: 3
Predicted impact top 59% in AS · last 90 daysOriginality Incremental advance
AI Analysis

For speaker anonymization systems, PHONOS reduces the risk of speaker identification via accent, a previously unaddressed bottleneck.

PHONOS addresses the problem of accent leakage in speaker anonymization by neutralizing non-native accents in real-time, achieving an 81% reduction in non-native accent confidence and latency under 241 ms on a single GPU.

Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes