SDMay 1
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation DistillationKuan-Po Huang, Bo-Ru Lu, Byeonggeun Kim et al.
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.
SDMar 22, 2022
Federated Self-Supervised Learning for Acoustic Event ClassificationMeng Feng, Chieh-Chi Kao, Qingming Tang et al.
Standard acoustic event classification (AEC) solutions require large-scale collection of data from client devices for model optimization. Federated learning (FL) is a compelling framework that decouples data collection and model training to enhance customer privacy. In this work, we investigate the feasibility of applying FL to improve AEC performance while no customer data can be directly uploaded to the server. We assume no pseudo labels can be inferred from on-device user inputs, aligning with the typical use cases of AEC. We adapt self-supervised learning to the FL framework for on-device continual learning of representations, and it results in improved performance of the downstream AEC classifiers without labeled/pseudo-labeled data available. Compared to the baseline w/o FL, the proposed method improves precision up to 20.3\% relatively while maintaining the recall. Our work differs from prior work in FL in that our approach does not require user-generated learning targets, and the data we use is collected from our Beta program and is de-identified, to maximally simulate the production settings.
ASJul 14, 2025
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token PredictionShu-wen Yang, Byeonggeun Kim, Kuan-Po Huang et al.
Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters -- 193M for our Base and 462M for our Large models.
SDFeb 5, 2021
Multi-Task Self-Supervised Pre-Training for Music ClassificationHo-Hsiang Wu, Chieh-Chi Kao, Qingming Tang et al.
Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks.
ASOct 13, 2020
On Front-end Gain Invariant Modeling for Wake Word SpottingYixin Gao, Noah D. Stein, Chieh-Chi Kao et al.
Wake word (WW) spotting is challenging in far-field due to the complexities and variations in acoustic conditions and the environmental interference in signal transmission. A suite of carefully designed and optimized audio front-end (AFE) algorithms help mitigate these challenges and provide better quality audio signals to the downstream modules such as WW spotter. Since the WW model is trained with the AFE-processed audio data, its performance is sensitive to AFE variations, such as gain changes. In addition, when deploying to new devices, the WW performance is not guaranteed because the AFE is unknown to the WW model. To address these issues, we propose a novel approach to use a new feature called $Δ$LFBE to decouple the AFE gain variations from the WW model. We modified the neural network architectures to accommodate the delta computation, with the feature extraction module unchanged. We evaluate our WW models using data collected from real household settings and showed the models with the $Δ$LFBE is robust to AFE gain changes. Specifically, when AFE gain changes up to $\pm$12dB, the baseline CNN model lost up to relative 19.0% in false alarm rate or 34.3% in false reject rate, while the model with $Δ$LFBE demonstrates no performance loss.
ASSep 3, 2020
Intra-Utterance Similarity Preserving Knowledge Distillation for Audio TaggingChun-Chieh Chang, Chieh-Chi Kao, Ming Sun et al.
Knowledge Distillation (KD) is a popular area of research for reducing the size of large models while still maintaining good performance. The outputs of larger teacher models are used to guide the training of smaller student models. Given the repetitive nature of acoustic events, we propose to leverage this information to regulate the KD training for Audio Tagging. This novel KD method, "Intra-Utterance Similarity Preserving KD" (IUSP), shows promising results for the audio tagging task. It is motivated by the previously published KD method: "Similarity Preserving KD" (SP). However, instead of preserving the pairwise similarities between inputs within a mini-batch, our method preserves the pairwise similarities between the frames of a single input utterance. Our proposed KD method, IUSP, shows consistent improvements over SP across student models of different sizes on the DCASE 2019 Task 5 dataset for audio tagging. There is a 27.1% to 122.4% percent increase in improvement of micro AUPRC over the baseline relative to SP's improvement of over the baseline.
ASAug 7, 2020
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average PoolingChieh-Chi Kao, Bowen Shi, Ming Sun et al.
This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data.
LGFeb 21, 2020
Few-shot acoustic event detection via meta-learningBowen Shi, Ming Sun, Krishna C. Puvvada et al.
We study few-shot acoustic event detection (AED) in this paper. Few-shot learning enables detection of new events with very limited labeled data. Compared to other research areas like computer vision, few-shot learning for audio recognition has been under-studied. We formulate few-shot AED problem and explore different ways of utilizing traditional supervised methods for this setting as well as a variety of meta-learning approaches, which are conventionally used to solve few-shot classification problem. Compared to supervised baselines, meta-learning models achieve superior performance, thus showing its effectiveness on generalization to new audio events. Our analysis including impact of initialization and domain discrepancy further validate the advantage of meta-learning approaches in few-shot AED.
ASFeb 14, 2020
A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event ClassificationChieh-Chi Kao, Ming Sun, Weiran Wang et al.
Acoustic event classification (AEC) and acoustic event detection (AED) refer to the task of detecting whether specific target events occur in audios. As long short-term memory (LSTM) leads to state-of-the-art results in various speech related tasks, it is employed as a popular solution for AEC as well. This paper focuses on investigating the dynamics of LSTM model on AEC tasks. It includes a detailed analysis on LSTM memory retaining, and a benchmarking of nine different pooling methods on LSTM models using 1.7M generated mixture clips of multiple events with different signal-to-noise ratios. This paper focuses on understanding: 1) utterance-level classification accuracy; 2) sensitivity to event position within an utterance. The analysis is done on the dataset for the detection of rare sound events from DCASE 2017 Challenge. We find max pooling on the prediction level to perform the best among the nine pooling approaches in terms of classification accuracy and insensitivity to event position within an utterance. To authors' best knowledge, this is the first kind of such work focused on LSTM dynamics for AEC tasks.
ASJul 2, 2019
Sub-band Convolutional Neural Networks for Small-footprint Spoken Term ClassificationChieh-Chi Kao, Ming Sun, Yixin Gao et al.
This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does not fit acoustic applications well since the meaning of a specific 2D kernel varies a lot along the feature axis in an input feature map. We propose a sub-band CNN architecture to apply different convolutional kernels on each feature sub-band, which makes the overall computation more efficient. Experimental results show that the computational efficiency brought by sub-band CNN is more beneficial for small-footprint models. Compared to a baseline full band CNN for spoken term classification on a publicly available Speech Commands dataset, the proposed sub-band CNN architecture reduces the computation by 39.7% on commands classification, and 49.3% on digits classification with accuracy maintained.
ASJul 1, 2019
Compression of Acoustic Event Detection Models With Quantized DistillationBowen Shi, Ming Sun, Chieh-Chi Kao et al.
Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems. Recently deep neural network significantly advances this field and reduces detection errors to a large scale. However how to efficiently execute deep models in AED has received much less attention. Meanwhile state-of-the-art AED models are based on large deep models, which are computational demanding and challenging to deploy on devices with constrained computational resources. In this paper, we present a simple yet effective compression approach which jointly leverages knowledge distillation and quantization to compress larger network (teacher model) into compact network (student model). Experimental results show proposed technique not only lowers error rate of original compact network by 15% through distillation but also further reduces its model size to a large extent (2% of teacher, 12% of full-precision student) through quantization.
ASMay 2, 2019
Compression of Acoustic Event Detection Models with Low-rank Matrix Factorization and Quantization TrainingBowen Shi, Ming Sun, Chieh-Chi Kao et al.
In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models. Our experimental results show this combined compression approach is very effective. For a three-layer long short-term memory (LSTM) based AED model, the original model size can be reduced to 1% with negligible loss of accuracy. Our approach enables the feasibility of deploying AED for resource-constraint applications.
ASApr 29, 2019
Semi-supervised Acoustic Event Detection based on tri-trainingBowen Shi, Ming Sun, Chieh-Chi Kao et al.
This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset. Recent acoustic event detectors are based on large-scale neural networks, which are typically trained with huge amounts of labeled data. Labels for acoustic events are expensive to obtain, and relevant acoustic event audios can be limited, especially for rare events. In this paper we leverage an Internet-scale unlabeled dataset with potential domain shift to improve the detection of acoustic events. Based on the classic tri-training approach, our proposed method shows accuracy improvement over both the supervised training baseline, and semisupervised self-training set-up, in all pre-defined acoustic event detection tasks. As our approach relies on ensemble models, we further show the improvements can be distilled to a single model via knowledge distillation, with the resulting single student model maintaining high accuracy of teacher ensemble models.
SDAug 20, 2018
A simple model for detection of rare sound eventsWeiran Wang, Chieh-chi Kao, Chao Wang
We propose a simple recurrent model for detecting rare sound events, when the time boundaries of events are available for training. Our model optimizes the combination of an utterance-level loss, which classifies whether an event occurs in an utterance, and a frame-level loss, which classifies whether each frame corresponds to the event when it does occur. The two losses make use of a shared vectorial representation the event, and are connected by an attention mechanism. We demonstrate our model on Task 2 of the DCASE 2017 challenge, and achieve competitive performance.
SDAug 20, 2018
R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event DetectionChieh-Chi Kao, Weiran Wang, Ming Sun et al.
This paper proposes a Region-based Convolutional Recurrent Neural Network (R-CRNN) for audio event detection (AED). The proposed network is inspired by Faster-RCNN, a well known region-based convolutional network framework for visual object detection. Different from the original Faster-RCNN, a recurrent layer is added on top of the convolutional network to capture the long-term temporal context from the extracted high level features. While most of the previous works on AED generate predictions at frame level first, and then use post-processing to predict the onset/offset timestamps of events from a probability sequence; the proposed method generates predictions at event level directly and can be trained end-to-end with a multitask loss, which optimizes the classification and localization of audio events simultaneously. The proposed method is tested on DCASE 2017 Challenge dataset. To the best of our knowledge, R-CRNN is the best performing single-model method among all methods without using ensembles both on development and evaluation sets. Compared to the other region-based network for AED (R-FCN) with an event-based error rate (ER) of 0.18 on the development set, our method reduced the ER to half.
CVJun 3, 2018
Patch-Based Image Hallucination for Super Resolution with Detail Reconstruction from Similar Sample ImagesChieh-Chi Kao, Yuxiang Wang, Jonathan Waltman et al.
Image hallucination and super-resolution have been studied for decades, and many approaches have been proposed to upsample low-resolution images using information from the images themselves, multiple example images, or large image databases. However, most of this work has focused exclusively on small magnification levels because the algorithms simply sharpen the blurry edges in the upsampled images - no actual new detail is typically reconstructed in the final result. In this paper, we present a patch-based algorithm for image hallucination which, for the first time, properly synthesizes novel high frequency detail. To do this, we pose the synthesis problem as a patch-based optimization which inserts coherent, high-frequency detail from contextually-similar images of the same physical scene/subject provided from either a personal image collection or a large online database. The resulting image is visually plausible and contains coherent high frequency information. We demonstrate the robustness of our algorithm by testing it on a large number of images and show that its performance is considerably superior to all state-of-the-art approaches, a result that is verified to be statistically significant through a randomized user study.
CVJan 16, 2018
Localization-Aware Active Learning for Object DetectionChieh-Chi Kao, Teng-Yok Lee, Pradeep Sen et al.
Active learning - a class of algorithms that iteratively searches for the most informative samples to include in a training dataset - has been shown to be effective at annotating data for image classification. However, the use of active learning for object detection is still largely unexplored as determining informativeness of an object-location hypothesis is more difficult. In this paper, we address this issue and present two metrics for measuring the informativeness of an object hypothesis, which allow us to leverage active learning to reduce the amount of annotated data needed to achieve a target object detection performance. Our first metric measures 'localization tightness' of an object hypothesis, which is based on the overlapping ratio between the region proposal and the final prediction. Our second metric measures 'localization stability' of an object hypothesis, which is based on the variation of predicted object locations when input images are corrupted by noise. Our experimental results show that by augmenting a conventional active-learning algorithm designed for classification with the proposed metrics, the amount of labeled training data required can be reduced up to 25%. Moreover, on PASCAL 2007 and 2012 datasets our localization-stability method has an average relative improvement of 96.5% and 81.9% over the baseline method using classification only.