Audio Deepfake Detection in the Age of Advanced Text-to-Speech models
This work addresses the challenge of audio deepfake detection for security and verification applications, but it is incremental as it focuses on evaluating and integrating existing detection methods rather than introducing a novel detection technique.
This paper tackled the problem of detecting audio deepfakes generated by advanced Text-to-Speech models, finding that single-paradigm detectors vary widely in effectiveness across different TTS architectures, while a multi-view approach combining semantic, structural, and signal-level analyses achieved robust performance across all evaluated models.
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.