Yixin Tian

SD
3papers
21citations
Novelty47%
AI Score34

3 Papers

SDSep 27, 2023
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Chunyu Qiang, Hao Li, Yixin Tian et al.

Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.

SYNov 13, 2025
Optimized Design of the Generalized Bilinear Transformation for Discretizing Analog Systems

Shen Chen, Yanlong Li, Jiamin Cui et al.

A common approach to digital system design involves transforming a continuous-time (s-domain) transfer function into the discrete-time (z-domain) using methods such as Euler or Tustin. These transformations are shown to be specific cases of the Generalized Bilinear Transformation (GBT), characterized by a design parameter, $α$, whose physical interpretation and optimal selection remain inadequately explored. In this paper, we propose an alternative derivation of the GBT derived by employing a new hexagonal shape to approximate the enclosed area of the error function, and we define the parameter $α$ as a shape factor. We reveal, for the first time, the physical meaning of $α$ as the backward rectangular ratio of the proposed hexagonal shape. Through domain mapping, the stable range of is rigorously established to be [0.5, 1]. Depending on the operating frequency and the chosen $α$, we observe two distinct distortion modes, i.e., the magnitude and phase distortion. We further develop an optimal design method for $α$ by minimizing a normalized magnitude or phase error objective function. The effectiveness of the proposed method is validated through the design and testing of a low-pass filter (LPF), demonstrating strong agreement between theoretical predictions and experimental results.

ASSep 1, 2023
Learning Speech Representation From Contrastive Token-Acoustic Pretraining

Chunyu Qiang, Hao Li, Yixin Tian et al.

For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing. We provide a website with audio samples.