SDJul 4, 2024Code
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMsKeyu An, Qian Chen, Chong Deng et al.
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.
SDAug 7, 2023Code
SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization AbilityXian Shi, Yexin Yang, Zerui Li et al.
Hotword customization is one of the concerned issues remained in ASR field - it is of value to enable users of ASR systems to customize names of entities, persons and other phrases to obtain better experience. The past few years have seen effective modeling strategies for ASR contextualization developed, but they still exhibit space for improvement about training stability and the invisible activation process. In this paper we propose Semantic-Augmented Contextual-Paraformer (SeACo-Paraformer) a novel NAR based ASR system with flexible and effective hotword customization ability. It possesses the advantages of AED-based model's accuracy, NAR model's efficiency, and explicit customization capacity of superior performance. Through extensive experiments with 50,000 hours of industrial big data, our proposed model outperforms strong baselines in customization. Besides, we explore an efficient way to filter large-scale incoming hotwords for further improvement. The industrial models compared, source codes and two hotword test sets are all open source.
CVMay 2, 2022
Open-Set Semi-Supervised Learning for 3D Point Cloud UnderstandingXian Shi, Xun Xu, Wanyue Zhang et al. · cambridge
Semantic understanding of 3D point cloud relies on learning models with massively annotated data, which, in many cases, are expensive or difficult to collect. This has led to an emerging research interest in semi-supervised learning (SSL) for 3D point cloud. It is commonly assumed in SSL that the unlabeled data are drawn from the same distribution as that of the labeled ones; This assumption, however, rarely holds true in realistic environments. Blindly using out-of-distribution (OOD) unlabeled data could harm SSL performance. In this work, we propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized. To estimate the weights, we adopt a bi-level optimization framework which iteratively optimizes a metaobjective on a held-out validation set and a task-objective on a training set. Faced with the instability of efficient bi-level optimizers, we further propose three regularization techniques to enhance the training stability. Extensive experiments on 3D point cloud classification and segmentation tasks verify the effectiveness of our proposed method. We also demonstrate the feasibility of a more efficient training strategy.
CLJan 29Code
Qwen3-ASR Technical ReportXian Shi, Xiong Wang, Zhifang Guo et al.
In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
SDJan 29, 2023
Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR ModelXian Shi, Yanni Chen, Shiliang Zhang et al.
Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.
99.4SDMar 25Code
Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and ModelKangxiang Xia, Bingshen Mu, Xian Shi et al.
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.
CLSep 22, 2025Code
Qwen3-Omni Technical ReportJin Xu, Zhifang Guo, Hangrui Hu et al. · pku
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
SDDec 13, 2024
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language ModelsZhihao Du, Yuxuan Wang, Qian Chen et al.
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.
SDMay 18, 2023Code
FunASR: A Fundamental End-to-End Speech Recognition ToolkitZhifu Gao, Zerui Li, Jiaming Wang et al.
This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance.
SDApr 7, 2021Code
Darts-Conformer: Towards Efficient Gradient-Based Neural Architecture Search For End-to-End ASRXian Shi, Pan Zhou, Wei Chen et al.
Neural architecture search (NAS) has been successfully applied to tasks like image classification and language modeling for finding efficient high-performance network architectures. In ASR field especially end-to-end ASR, the related research is still in its infancy. In this work, we focus on applying NAS on the most popular manually designed model: Conformer, and then propose an efficient ASR model searching method that benefits from the natural advantage of differentiable architecture search (Darts) in reducing computational overheads. We fuse Darts mutator and Conformer blocks to form a complete search space, within which a modified architecture called Darts-Conformer cell is found automatically. The entire searching process on AISHELL-1 dataset costs only 0.7 GPU days. Replacing the Conformer encoder by stacking searched cell, we get an end-to-end ASR model (named as Darts-Conformner) that outperforms the Conformer baseline by 4.7\% on the open-source AISHELL-1 dataset. Besides, we verify the transferability of the architecture searched on a small dataset to a larger 2k-hour dataset. To the best of our knowledge, this is the first successful attempt to apply gradient-based architecture search in the attention-based encoder-decoder ASR model.
CVSep 10, 2020Code
CAD-PU: A Curvature-Adaptive Deep Learning Solution for Point Set UpsamplingJiehong Lin, Xian Shi, Yuan Gao et al.
Point set is arguably the most direct approximation of an object or scene surface, yet its practical acquisition often suffers from the shortcoming of being noisy, sparse, and possibly incomplete, which restricts its use for a high-quality surface recovery. Point set upsampling aims to increase its density and regularity such that a better surface recovery could be achieved. The problem is severely ill-posed and challenging, considering that the upsampling target itself is only an approximation of the underlying surface. Motivated to improve the surface approximation via point set upsampling, we identify the factors that are critical to the objective, by pairing the surface approximation error bounds of the input and output point sets. It suggests that given a fixed budget of points in the upsampling result, more points should be distributed onto the surface regions where local curvatures are relatively high. To implement the motivation, we propose a novel design of Curvature-ADaptive Point set Upsampling network (CAD-PU), the core of which is a module of curvature-adaptive feature expansion. To train CAD-PU, we follow the same motivation and propose geometrically intuitive surrogates that approximate discrete notions of surface curvature for the upsampled point set. We further integrate the proposed surrogates into an adversarial learning based curvature minimization objective, which gives a practically effective learning of CAD-PU. We conduct thorough experiments that show the efficacy of our contributions and the advantages of our method over existing ones. Our implementation codes are publicly available at https://github.com/JiehongLin/CAD-PU.
CLJan 10, 2025
MinMo: A Multimodal Large Language Model for Seamless Voice InteractionQian Chen, Yafeng Chen, Yanni Chen et al.
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.
SDMay 23, 2025
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-trainingZhihao Du, Changfeng Gao, Yuxuan Wang et al.
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.
EPNov 4, 2024
Deep operator neural network applied to efficient computation of asteroid surface temperature and the Yarkovsky effectShunjing Zhao, Hanlun Lei, Xian Shi
Surface temperature distribution is crucial for thermal property-based studies about irregular asteroids in our Solar System. While direct numerical simulations could model surface temperatures with high fidelity, they often take a significant amount of computational time, especially for problems where temperature distributions are required to be repeatedly calculated. To this end, deep operator neural network (DeepONet) provides a powerful tool due to its high computational efficiency and generalization ability. In this work, we applied DeepONet to the modelling of asteroid surface temperatures. Results show that the trained network is able to predict temperature with an accuracy of ~1% on average, while the computational cost is five orders of magnitude lower, hence enabling thermal property analysis in a multidimensional parameter space. As a preliminary application, we analyzed the orbital evolution of asteroids through direct N-body simulations embedded with instantaneous Yarkovsky effect inferred by DeepONet-based thermophysical modelling.Taking asteroids (3200) Phaethon and (89433) 2001 WM41 as examples, we show the efficacy and efficiency of our AI-based approach.
EPMay 20, 2025
ThermoONet -- a deep learning-based small body thermophysical network: applications to modelling water activity of cometsShunjing Zhao, Xian Shi, Hanlun Lei
Cometary activity is a compelling subject of study, with thermophysical models playing a pivotal role in its understanding. However, traditional numerical solutions for small body thermophysical models are computationally intensive, posing challenges for investigations requiring high-resolution or repetitive modeling. To address this limitation, we employed a machine learning approach to develop ThermoONet - a neural network designed to predict the temperature and water ice sublimation flux of comets. Performance evaluations indicate that ThermoONet achieves a low average error in subsurface temperature of approximately 2% relative to the numerical simulation, while reducing computational time by nearly six orders of magnitude. We applied ThermoONet to model the water activity of comets 67P/Churyumov-Gerasimenko and 21P/Giacobini-Zinner. By successfully fitting the water production rate curves of these comets, as obtained by the Rosetta mission and the SOHO telescope, respectively, we demonstrate the network's effectiveness and efficiency. Furthermore, when combined with a global optimization algorithm, ThermoONet proves capable of retrieving the physical properties of target bodies.
SDMay 18, 2023
Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition SystemXian Shi, Haoneng Luo, Zhifu Gao et al.
Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies. Previous end-to-end~(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur. In this paper we proposed CIF-Aligned confidence estimation model (CA-CEM) to achieve accurate and reliable confidence estimation based on novel non-autoregressive E2E ASR model - Paraformer. CA-CEM utilizes the modeling character of continuous integrate-and-fire (CIF) mechanism to generate token-synchronous acoustic embedding, which solves the estimation failure issue above. We measure the quality of estimation with AUC and RMSE in token level and ECE-U - a proposed metrics in utterance level. CA-CEM gains 24% and 19% relative reduction on ECE-U and also better AUC and RMSE on two test sets. Furthermore, we conduct analysis to explore the potential of CEM for different ASR related usage.
SDFeb 20, 2021
The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and MethodsXian Shi, Fan Yu, Yizhou Lu et al.
The variety of accents has posed a big challenge to speech recognition. The Accented English Speech Recognition Challenge (AESRC2020) is designed for providing a common testbed and promoting accent-related research. Two tracks are set in the challenge -- English accent recognition (track 1) and accented English speech recognition (track 2). A set of 160 hours of accented English speech collected from 8 countries is released with labels as the training set. Another 20 hours of speech without labels is later released as the test set, including two unseen accents from another two countries used to test the model generalization ability in track 2. We also provide baseline systems for the participants. This paper first reviews the released dataset, track setups, baselines and then summarizes the challenge results and major techniques used in the submissions.
CVJan 18, 2021
Label-Efficient Point Cloud Semantic Segmentation: An Active Learning ApproachXian Shi, Xun Xu, Ke Chen et al.
Deep learning models are the state-of-the-art methods for semantic point cloud segmentation, the success of which relies on the availability of large-scale annotated datasets. However, it can be extremely time-consuming and prohibitively expensive to compile such datasets. In this work, we propose an active learning approach to maximize model performance given limited annotation budgets. We investigate the appropriate sample granularity for active selection under realistic annotation cost measurement (clicks), and demonstrate that super-point based selection allows for more efficient usage of the limited budget compared to point-level and instance-level selection. We further exploit local consistency constraints to boost the performance of the super-point based approach. We evaluate our methods on two benchmarking datasets (ShapeNet and S3DIS) and the results demonstrate that active learning is an effective strategy to address the high annotation costs in semantic point cloud segmentation.
SDNov 17, 2020
Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character ConverterXiong Wang, Zhuoyuan Yao, Xian Shi et al.
End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.
ASJul 12, 2020
The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and ResultsXian Shi, Qiangze Feng, Lei Xie
Code-switching (CS) is a common phenomenon and recognizing CS speech is challenging. But CS speech data is scarce and there' s no common testbed in relevant research. This paper describes the design and main outcomes of the ASRU 2019 Mandarin-English code-switching speech recognition challenge, which aims to improve the ASR performance in Mandarin-English code-switching situation. 500 hours Mandarin speech data and 240 hours Mandarin-English intra-sentencial CS data are released to the participants. Three tracks were set for advancing the AM and LM part in traditional DNN-HMM ASR system, as well as exploring the E2E models' performance. The paper then presents an overview of the results and system performance in the three tracks. It turns out that traditional ASR system benefits from pronunciation lexicon, CS text generating and data augmentation. In E2E track, however, the results highlight the importance of using language identification, building-up a rational set of modeling units and spec-augment. The other details in model training and method comparsion are discussed.