SDAIASJan 22

DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

arXiv:2601.15596v11 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses the challenge of zero-shot ASMR generation for relaxation applications, representing a novel method rather than an incremental improvement.

The paper tackles the problem of generating Autonomous Sensory Meridian Response (ASMR) speech, a specialized low-intensity style for relaxation, by introducing DeepASMR, a framework that uses a single short snippet of ordinary speech to synthesize high-fidelity ASMR in any voice without whispered training data, achieving state-of-the-art naturalness and style fidelity.

While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes