SDCLASJul 15, 2023

Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

arXiv:2307.07683v223 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the need for reliable voice cloning detection to prevent harms like fraud and disinformation, though it is incremental in improving feature extraction methods.

The paper tackled the problem of detecting cloned voices impersonating specific individuals, showing that learned features achieve an equal error rate between 0% and 4% and are robust to adversarial laundering.

Synthetic-voice cloning technologies have seen significant advances in recent years, giving rise to a range of potential harms. From small- and large-scale financial fraud to disinformation campaigns, the need for reliable methods to differentiate real and synthesized voices is imperative. We describe three techniques for differentiating a real from a cloned voice designed to impersonate a specific person. These three approaches differ in their feature extraction stage with low-dimensional perceptual features offering high interpretability but lower accuracy, to generic spectral features, and end-to-end learned features offering less interpretability but higher accuracy. We show the efficacy of these approaches when trained on a single speaker's voice and when trained on multiple voices. The learned features consistently yield an equal error rate between 0% and 4%, and are reasonably robust to adversarial laundering.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes