85.7SDMay 27
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from TextJiahao Mei, Heinrich Dinkel, Yadong Niu et al.
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.
99.6SDMar 26Code
DashengTokenizer: One layer is enough for unified audio understanding and generationHeinrich Dinkel, Xingwei Sun, Gang Li et al. · apple-ml
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.
89.3SDMar 26Code
MiDashengLM: Efficient Audio Understanding with General Audio CaptionsHeinrich Dinkel, Gang Li, Jizhong Liu et al.
Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.
89.7ASMar 25Code
ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understandingYadong Niu, Tianzi Wang, Heinrich Dinkel et al.
General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.
82.1SDMar 24
The Interspeech 2026 Audio Encoder Capability Challenge for Large Audio Language ModelsHeinrich Dinkel, Jiahao Zhou, Guanbo Wang et al.
This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable understanding of complex acoustic scenes, their performance depends on the semantic richness of the underlying audio encoder representations. This challenge addresses the integration gap by providing a unified generative evaluation framework, XARES-LLM, which assesses submitted encoders across a diverse suite of downstream classification and generation tasks. By decoupling encoder development from LLM fine-tuning, the challenge establishes a standardized protocol for general-purpose audio representations that can effectively be used for the next generation of multimodal language models.
SDMar 14, 2025Code
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question AnsweringGang Li, Jizhong Liu, Heinrich Dinkel et al.
Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%. The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only 38k post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi-research/r1-aqa and https://huggingface.co/mispeech/r1-aqa.
SDJun 12, 2025Code
GLAP: General contrastive audio-text pretraining across domains and languagesHeinrich Dinkel, Zhiyong Yan, Tianzi Wang et al.
Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.
ASJul 31, 2025Code
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding TasksYadong Niu, Tianzi Wang, Heinrich Dinkel et al.
While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat
SDJun 11, 2024Code
Bridging Language Gaps in Audio-Text RetrievalZhiyong Yan, Heinrich Dinkel, Yongqing Wang et al.
Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.
CVJul 13, 2020Code
Multiple Sound Sources Localization from Coarse to FineRui Qian, Di Hu, Heinrich Dinkel et al.
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner. Our model achieves state-of-the-art results on public dataset of localization, as well as considerable performance on multi-source sound localization in complex scenes. We then employ the localization results for sound separation and obtain comparable performance to existing methods. These outcomes demonstrate our model's ability in effectively aligning sounds with specific visual sources. Code is available at https://github.com/shvdiwnkozbw/Multi-Source-Sound-Localization
SDJun 19, 2024
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingJizhong Liu, Gang Li, Junbo Zhang et al.
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
SDMay 10, 2021
Voice activity detection in the wild: A data-driven approach using teacher-student trainingHeinrich Dinkel, Shuai Wang, Xuenan Xu et al.
Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.
SDFeb 23, 2021
Text-to-Audio Grounding: Building Correspondence Between Captions and Sound EventsXuenan Xu, Heinrich Dinkel, Mengyue Wu et al.
Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips' sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship between audio processing and language understanding. A baseline approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.
SDFeb 23, 2021
Investigating Local and Global Information for Automated Audio Captioning with Transfer LearningXuenan Xu, Heinrich Dinkel, Mengyue Wu et al.
Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery. Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedded in the audio automatically. This paper first proposes a topic model for audio descriptions, comprehensively analyzing the hierarchical audio topics that are commonly covered. We then explore a transfer learning scheme to access local and global information. Two source tasks are identified to respectively represent local and global information, being Audio Tagging (AT) and Acoustic Scene Classification (ASC). Experiments are conducted on the AAC benchmark dataset Clotho and Audiocaps, amounting to a vast increase in all eight metrics with topic transfer learning. Further, it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.
SDJan 19, 2021
Towards duration robust weakly supervised sound event detectionHeinrich Dinkel, Mengyue Wu, Kai Yu
Sound event detection (SED) is the task of tagging the absence or presence of audio events and their corresponding interval within a given audio clip. While SED can be done using supervised machine learning, where training data is fully labeled with access to per event timestamps and duration, our work focuses on weakly-supervised sound event detection (WSSED), where prior knowledge about an event's duration is unavailable. Recent research within the field focuses on improving segment- and event-level localization performance for specific datasets regarding specific evaluation metrics. Specifically, well-performing event-level localization requires fully labeled development subsets to obtain event duration estimates, which significantly benefits localization performance. Moreover, well-performing segment-level localization models output predictions at a coarse-scale (e.g., 1 second), hindering their deployment on datasets containing very short events (< 1 second). This work proposes a duration robust CRNN (CDur) framework, which aims to achieve competitive performance in terms of segment- and event-level localization. This paper proposes a new post-processing strategy named "Triple Threshold" and investigates two data augmentation methods along with a label smoothing method within the scope of WSSED. Evaluation of our model is done on the DCASE2017 and 2018 Task 4 datasets, and URBAN-SED. Our model outperforms other approaches on the DCASE2018 and URBAN-SED datasets without requiring prior duration knowledge. In particular, our model is capable of similar performance to strongly-labeled supervised models on the URBAN-SED dataset. Lastly, ablation experiments to reveal that without post-processing, our model's localization performance drop is significantly lower compared with other approaches.
ASJul 26, 2020
End-to-end spoofing detection with raw waveform CLDNNsHeinrich Dinkel, Nanxin Chen, Yanmin Qian et al.
Albeit recent progress in speaker verification generates powerful models, malicious attacks in the form of spoofed speech, are generally not coped with. Recent results in ASVSpoof2015 and BTAS2016 challenges indicate that spoof-aware features are a possible solution to this problem. Most successful methods in both challenges focus on spoof-aware features, rather than focusing on a powerful classifier. In this paper we present a novel raw waveform based deep model for spoofing detection, which jointly acts as a feature extractor and classifier, thus allowing it to directly classify speech signals. This approach can be considered as an end-to-end classifier, which removes the need for any pre- or post-processing on the data, making training and evaluation a streamlined process, consuming less time than other neural-network based approaches. The experiments on the BTAS2016 dataset show that the system performance is significantly improved by the proposed raw waveform convolutional long short term neural network (CLDNN), from the previous best published 1.26\% half total error rate (HTER) to the current 0.82\% HTER. Moreover it shows that the proposed system also performs well under the unknown (RE-PH2-PH3,RE-LPPH2-PH3) conditions.
SDMar 27, 2020
Voice activity detection in the wild via weakly supervised sound event detectionHeinrich Dinkel, Yefei Chen, Mengyue Wu et al.
Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-purpose VAD (GPVAD) framework, which can be easily trained from noisy data in a weakly supervised fashion, requiring only clip-level labels. We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise. We evaluate the two GPV models against a CRNN based standard VAD model (VAD-C) on three different evaluation protocols (clean, synthetic noise, real data). Results show that our proposed GPV-F demonstrates competitive performance in clean and synthetic scenarios compared to traditional VAD-C. Further, in real-world evaluation, GPV-F largely outperforms VAD-C in terms of frame-level evaluation metrics as well as segment-level ones. With a much lower requirement for frame-labeled data, the naive binary clip-level GPV-B model can still achieve comparable performance to VAD-C in real-world scenarios.
HCOct 29, 2019
DEPA: Self-Supervised Audio Embedding for Depression DetectionPingyue Zhang, Mengyue Wu, Heinrich Dinkel et al.
Depression detection research has increased over the last few decades, one major bottleneck of which is the limited data availability and representation learning. Recently, self-supervised learning has seen success in pretraining text embeddings and has been applied broadly on related tasks with sparse data, while pretrained audio embeddings based on self-supervised learning are rarely investigated. This paper proposes DEPA, a self-supervised, pretrained depression audio embedding method for depression detection. An encoder-decoder network is used to extract DEPA on in-domain depressed datasets (DAIC and MDD) and out-domain (Switchboard, Alzheimer's) datasets. With DEPA as the audio embedding extracted at response-level, a significant performance gain is achieved on downstream tasks, evaluated on both sparse datasets like DAIC and large major depression disorder dataset (MDD). This paper not only exhibits itself as a novel embedding extracting method capturing response-level representation for depression detection but more significantly, is an exploration of self-supervised learning in a specific task within audio processing.
SDMay 31, 2019
Audio Caption in a Car Setting with a Sentence-Level LossXuenan Xu, Heinrich Dinkel, Mengyue Wu et al.
Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the model on the newly-proposed Car dataset, a previously published Mandarin Hospital dataset and the Joint dataset, indicating its generalization capability across different scenes. An improvement in all metrics can be observed, including classical natural language generation (NLG) metrics, sentence richness and human evaluation ratings. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions on many aspects.
SDApr 8, 2019
Duration robust weakly supervised sound event detectionHeinrich Dinkel, Kai Yu
Task 4 of the DCASE2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long (e.g., over 5s) clips. This work aims to investigate the performance impact of fixed-sized window median filter post-processing and advocate the use of double thresholding as a more robust and predictable post-processing method. Further, four different temporal subsampling methods within the CRNN framework are proposed: mean-max, alpha-mean-max, Lp-norm and convolutional. We show that for this task subsampling the temporal resolution by a neural network enhances the F1 score as well as its robustness towards short, sporadic sound events. Our best single model achieves 30.1% F1 on the evaluation set and the best fusion model 32.5%, while being robust to event length variations.
LGApr 8, 2019
Text-based depression detection on sparse dataHeinrich Dinkel, Mengyue Wu, Kai Yu
Previous text-based depression detection is commonly based on large user-generated data. Sparse scenarios like clinical conversations are less investigated. This work proposes a text-based multi-task BGRU network with pretrained word embeddings to model patients' responses during clinical interviews. Our main approach uses a novel multi-task loss function, aiming at modeling both depression severity and binary health state. We independently investigate word- and sentence-level word-embeddings as well as the use of large-data pretraining for depression detection. To strengthen our findings, we report mean-averaged results for a multitude of independent runs on sparse data. First, we show that pretraining is helpful for word-level text-based depression detection. Second, our results demonstrate that sentence-level word-embeddings should be mostly preferred over word-level ones. While the choice of pooling function is less crucial, mean and attention pooling should be preferred over last-timestep pooling. Our method outputs depression presence results as well as predicted severity score, culminating a macro F1 score of 0.84 and MAE of 3.48 on the DAIC-WOZ development set.
SDFeb 25, 2019
Audio Caption: Listen and TellMengyue Wu, Heinrich Dinkel, Kai Yu
Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.