SDMay 28, 2025
AudioTurbo: Fast Text-to-Audio Generation with Rectified DiffusionJunqi Zhao, Jinzheng Zhao, Haohe Liu et al.
Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.
SDApr 2, 2021
An Audio-Based Deep Learning Framework For BBC Television Programme ClassificationLam Pham, Chris Baume, Qiuqiang Kong et al.
This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabilities and detected sound events are then calculated to extract discriminative features representing the television programmes. Finally, the embedded features extracted are fed into a classifier for classifying the programmes into different genres. Our experiments are conducted over a dataset of 6,160 programmes belonging to nine genres labelled by the BBC. We achieve an average classification accuracy of 93.7% over 14-fold cross validation. This demonstrates the efficacy of the proposed framework for the task of audio-based classification of television programmes.
LGJul 3, 2019
Deep Learning Based Energy Disaggregation and On/Off Detection of Household AppliancesJie Jiang, Qiuqiang Kong, Mark Plumbley et al.
Energy disaggregation, a.k.a. Non-Intrusive Load Monitoring, aims to separate the energy consumption of individual appliances from the readings of a mains power meter measuring the total energy consumption of, e.g. a whole house. Energy consumption of individual appliances can be useful in many applications, e.g., providing appliance-level feedback to the end users to help them understand their energy consumption and ultimately save energy. Recently, with the availability of large-scale energy consumption datasets, various neural network models such as convolutional neural networks and recurrent neural networks have been investigated to solve the energy disaggregation problem. Neural network models can learn complex patterns from large amounts of data and have been shown to outperform the traditional machine learning methods such as variants of hidden Markov models. However, current neural network methods for energy disaggregation are either computational expensive or are not capable of handling long-term dependencies. In this paper, we investigate the application of the recently developed WaveNet models for the task of energy disaggregation. Based on a real-world energy dataset collected from 20 households over two years, we show that WaveNet models outperforms the state-of-the-art deep learning methods proposed in the literature for energy disaggregation in terms of both error measures and computational cost. On the basis of energy disaggregation, we then investigate the performance of two deep-learning based frameworks for the task of on/off detection which aims at estimating whether an appliance is in operation or not. Based on the same dataset, we show that for the task of on/off detection the second framework, i.e., directly training a binary classifier, achieves better performance in terms of F1 score.
SDOct 6, 2016
A Joint Detection-Classification Model for Audio Tagging of Weakly Labelled DataQiuqiang Kong, Yong Xu, Wenwu Wang et al.
Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based on the audio events in the clip and no event level label is provided to the user. Previous works have used the bag of frames model assume the tags occur all the time, which is not the case in practice. We propose a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously. The JDC model has the ability to attend to informative and ignore uninformative sounds. Then only informative regions are used for classification. Experimental results on the "CHiME Home" dataset show that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%. More interestingly, the audio event detector is trained successfully without needing the event level label.
MSOct 1, 2012
Best Practices for Scientific ComputingGreg Wilson, D. A. Aruliah, C. Titus Brown et al.
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists' productivity and the reliability of their software.