ASMay 21
OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice ConversionZhichao Wang, Tao Li, Wenshuo Ge et al.
Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Audio samples are available on demo page.
ASJun 12, 2023
MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion RecognitionHaiyang Sun, Fulin Zhang, Yingying Gao et al.
Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.
ASJun 26, 2022
Meta Auxiliary Learning for Low-resource Spoken Language UnderstandingYingying Gao, Junlan Feng, Chao Deng et al.
Spoken language understanding (SLU) treats automatic speech recognition (ASR) and natural language understanding (NLU) as a unified task and usually suffers from data scarcity. We exploit an ASR and NLU joint training method based on meta auxiliary learning to improve the performance of low-resource SLU task by only taking advantage of abundant manual transcriptions of speech data. One obvious advantage of such method is that it provides a flexible framework to implement a low-resource SLU training task without requiring access to any further semantic annotations. In particular, a NLU model is taken as label generation network to predict intent and slot tags from texts; a multi-task network trains ASR task and SLU task synchronously from speech; and the predictions of label generation network are delivered to the multi-task network as semantic targets. The efficiency of the proposed algorithm is demonstrated with experiments on the public CATSLU dataset, which produces more suitable ASR hypotheses for the downstream NLU task.
LGOct 23, 2023
Cascaded Multi-task Adaptive Learning Based on Neural Architecture SearchYingying Gao, Shilei Zhang, Zihao Cui et al.
Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end cascaded multi-task models based on Neural Architecture Search (NAS) framework. The candidate adaptive operations on each specific module consist of frozen, inserting an adapter and fine-tuning. We further add a penalty item on the loss to limit the learned structure which takes the amount of trainable parameters into account. The penalty item successfully restrict the searched architecture and the proposed approach is able to search similar tuning scheme with hand-craft, compressing the optimizing parameters to 8.7% corresponding to full fine-tuning on SLURP with an even better performance.
SDApr 10, 2021Code
Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASRFan Yu, Haoneng Luo, Pengcheng Guo et al.
Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set.
ASMar 22
SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music ComposingJianyi Chen, Rongxiu Zhong, Shilei Zhang et al.
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
SDDec 4, 2024
DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable StylesJiaxuan Liu, Zhaoci Liu, Yajun Hu et al.
Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.
ASSep 23, 2025
Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise DistillationRunyan Yang, Yuke Si, Yingying Gao et al.
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.
ASSep 23, 2025
HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language ModelingYuke Si, Runyan Yang, Yingying Gao et al.
Recent advances in large language models have facilitated the development of unified speech language models (SLMs) capable of supporting multiple speech tasks within a shared architecture. However, tasks such as automatic speech recognition (ASR) and speech emotion recognition (SER) rely on distinct types of information: ASR primarily depends on linguistic content, whereas SER requires the integration of both linguistic and paralinguistic cues. Existing multitask SLMs typically adopt naive parameter sharing or prompt-based conditioning without explicitly modeling the differences in information composition required by each task. Such designs risk task interference and performance degradation, especially under limited data conditions. To address these limitations, we propose HarmoniFuse, a component-selective and prompt-adaptive framework for multi-task speech language modeling. HarmoniFuse is designed to harmonize heterogeneous task demands by selecting and fusing task-relevant components of speech representations. Specifically, it integrates a gated speech encoder to extract task-specific acoustic features and a prompt-adaptive dynamic fusion module to aggregate transformer layers based on task characteristics. In addition, a batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation. Experimental results demonstrate that HarmoniFuse improves both ASR and SER performance, offering a scalable and robust solution for multitask speech understanding under realistic data constraints.
LGMay 15, 2025
UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-SpeechJiaxuan Liu, Yang Xiang, Han Zhao et al.
Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotional annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities. Code and demos are available at: https://anonymous.4open.science/w/UDDETTS.
CLJun 26, 2024
Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect IdentificationYaqian Hao, Chenguang Hu, Yingying Gao et al.
The diverse nature of dialects presents challenges for models trained on specific linguistic patterns, rendering them susceptible to errors when confronted with unseen or out-of-distribution (OOD) data. This study introduces a novel margin-enhanced joint energy model (MEJEM) tailored specifically for OOD detection in dialects. By integrating a generative model and the energy margin loss, our approach aims to enhance the robustness of dialect identification systems. Furthermore, we explore two OOD scores for OOD dialect detection, and our findings conclusively demonstrate that the energy score outperforms the softmax score. Leveraging Sharpness-Aware Minimization to optimize the training process of the joint model, we enhance model generalization by minimizing both loss and sharpness. Experiments conducted on dialect identification tasks validate the efficacy of Energy-Based Models and provide valuable insights into their performance.
CLJun 12, 2024
PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task ModelsRunyan Yang, Huibao Yang, Xiqing Zhang et al.
Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis, and two speech classification tasks. PolySpeech takes multi-modal language model as its core structure and uses semantic representations as speech inputs. We introduce semantic speech embedding tokenization and speech reconstruction methods to PolySpeech, enabling efficient generation of high-quality speech for any given speaker. PolySpeech shows competitiveness across various tasks compared to single-task models. In our experiments, multitask optimization achieves performance comparable to single-task optimization and is especially beneficial for specific tasks.
ASJun 12, 2024
GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative ModelYingying Gao, Shilei Zhang, Chao Deng et al.
Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network. The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively. Experiments on SUPERB reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks. Ultimately, the proposed GenDistiller reduces the size of WavLM by 82%.
ASJan 30, 2022
HGCN: Harmonic gated compensation network for speech enhancementTianrui Wang, Weibin Zhu, Yingying Gao et al.
Mask processing in the time-frequency (T-F) domain through the neural network has been one of the mainstreams for single-channel speech enhancement. However, it is hard for most models to handle the situation when harmonics are partially masked by noise. To tackle this challenge, we propose a harmonic gated compensation network (HGCN). We design a high-resolution harmonic integral spectrum to improve the accuracy of harmonic locations prediction. Then we add voice activity detection (VAD) and voiced region detection (VRD) to the convolutional recurrent network (CRN) to filter harmonic locations. Finally, the harmonic gating mechanism is used to guide the compensation model to adjust the coarse results from CRN to obtain the refinedly enhanced results. Our experiments show HGCN achieves substantial gain over a number of advanced approaches in the community.
CVDec 11, 2018
Identity-Enhanced Network for Facial Expression RecognitionYanwei Li, Xingang Wang, Shilei Zhang et al.
Facial expression recognition is a challenging task, arguably because of large intra-class variations and high inter-class similarities. The core drawback of the existing approaches is the lack of ability to discriminate the changes in appearance caused by emotions and identities. In this paper, we present a novel identity-enhanced network (IDEnNet) to eliminate the negative impact of identity factor and focus on recognizing facial expressions. Spatial fusion combined with self-constrained multi-task learning are adopted to jointly learn the expression representations and identity-related information. We evaluate our approach on three popular datasets, namely Oulu-CASIA, CK+ and MMI. IDEnNet improves the baseline consistently, and achieves the best or comparable state-of-the-art on all three datasets.
CVJan 5, 2017
Autoencoder Regularized Network For Driving Style Representation LearningWeishan Dong, Ting Yuan, Kai Yang et al.
In this paper, we study learning generalized driving style representations from automobile GPS trip data. We propose a novel Autoencoder Regularized deep neural Network (ARNet) and a trip encoding framework trip2vec to learn drivers' driving styles directly from GPS records, by combining supervised and unsupervised feature learning in a unified architecture. Experiments on a challenging driver number estimation problem and the driver identification problem show that ARNet can learn a good generalized driving style representation: It significantly outperforms existing methods and alternative architectures by reaching the least estimation error on average (0.68, less than one driver) and the highest identification accuracy (by at least 3% improvement) compared with traditional supervised learning methods.