SDSep 19, 2023
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency DistillationYatong Bai, Trung Dang, Dung Tran et al. · microsoft-research
Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
SDMar 14, 2024Code
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio MixturesAfrina Tabassum, Dung Tran, Trung Dang et al.
Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE
CVFeb 13, 2024
Learned Image Compression with Text Quality EnhancementChih-Yu Lai, Dung Tran, Kazuhito Koishida
Learned image compression has gained widespread popularity for their efficiency in achieving ultra-low bit-rates. Yet, images containing substantial textual content, particularly screen-content images (SCI), often suffers from text distortion at such compressed levels. To address this, we propose to minimize a novel text logit loss designed to quantify the disparity in text between the original and reconstructed images, thereby improving the perceptual quality of the reconstructed text. Through rigorous experimentation across diverse datasets and employing state-of-the-art algorithms, our findings reveal significant enhancements in the quality of reconstructed text upon integration of the proposed loss function with appropriate weighting. Notably, we achieve a Bjontegaard delta (BD) rate of -32.64% for Character Error Rate (CER) and -28.03% for Word Error Rate (WER) on average by applying the text logit loss for two screenshot datasets. Additionally, we present quantitative metrics tailored for evaluating text quality in image compression tasks. Our findings underscore the efficacy and potential applicability of our proposed text logit loss function across various text-aware image compression contexts.
CVJun 1, 2025
MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical FlowsHong Nguyen, Dung Tran, Hieu Hoang et al.
Many motion-centric video analysis tasks, such as atomic actions, detecting atypical motor behavior in individuals with autism, or analyzing articulatory motion in real-time MRI of human speech, require efficient and interpretable temporal modeling. Capturing temporal dynamics is a central challenge in video analysis, often requiring significant computational resources and fine-grained annotations that are not widely available. This paper presents MOOSE (Motion Flow Over Spatial Space), a novel temporally-centric video encoder explicitly integrating optical flow with spatial embeddings to model temporal information efficiently, inspired by human perception of motion. Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch. This significantly reduces computational complexity while enhancing temporal interpretability. Our primary contributions includes (1) proposing a computationally efficient temporally-centric architecture for video understanding (2) demonstrating enhanced interpretability in modeling temporal dynamics; and (3) achieving state-of-the-art performance on diverse benchmarks, including clinical, medical, and standard action recognition datasets, confirming the broad applicability and effectiveness of our approach.
CVJan 5, 2022
Corrupting Data to Remove Deceptive Perturbation: Using Preprocessing Method to Improve System RobustnessHieu Le, Hans Walker, Dung Tran et al.
Although deep neural networks have achieved great performance on classification tasks, recent studies showed that well trained networks can be fooled by adding subtle noises. This paper introduces a new approach to improve neural network robustness by applying the recovery process on top of the naturally trained classifier. In this approach, images will be intentionally corrupted by some significant operator and then be recovered before passing through the classifiers. SARGAN -- an extension on Generative Adversarial Networks (GAN) is capable of denoising radar signals. This paper will show that SARGAN can also recover corrupted images by removing the adversarial effects. Our results show that this approach does improve the performance of naturally trained networks.
ASDec 21, 2021
Augmented Contrastive Self-Supervised Learning for Audio Invariant RepresentationsMelikasadat Emami, Dung Tran, Kazuhito Koishida
Improving generalization is a major challenge in audio classification due to labeled data scarcity. Self-supervised learning (SSL) methods tackle this by leveraging unlabeled data to learn useful features for downstream classification tasks. In this work, we propose an augmented contrastive SSL framework to learn invariant representations from unlabeled data. Our method applies various perturbations to the unlabeled input data and utilizes contrastive learning to learn representations robust to such perturbations. Experimental results on the Audioset and DESED datasets show that our framework significantly outperforms state-of-the-art SSL and supervised learning methods on sound/event classification tasks.
SDDec 8, 2021
Training Robust Zero-Shot Voice Conversion Models with Self-supervised FeaturesTrung Dang, Dung Tran, Peter Chin et al.
Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on LibriTTS using these techniques achieves the best performance, producing audio samples transferred well to the target speaker's voice, while preserving the linguistic content that is comparable with actual human utterances in terms of Character Error Rate.
ASApr 19, 2019
An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch ConditionsAswin Shanmugam Subramanian, Xiaofei Wang, Shinji Watanabe et al.
Sequence-to-sequence (S2S) modeling is becoming a popular paradigm for automatic speech recognition (ASR) because of its ability to jointly optimize all the conventional ASR components in an end-to-end (E2E) fashion. This report investigates the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the S2S model. There have been previous studies on jointly optimizing neural beamforming alongside E2E ASR for denoising. It is clear from both recent challenge outcomes and successful products that far-field systems would be incomplete without solving both denoising and dereverberation simultaneously. This report uses a recently developed architecture for far-field ASR by composing neural extensions of dereverberation and beamforming modules with the S2S ASR module as a single differentiable neural network and also clearly defining the role of each subnetwork. The original implementation of this architecture was successfully applied to the noisy speech recognition task (CHiME-4), while we applied this implementation to noisy reverberant tasks (DIRHA and REVERB). Our investigation shows that the method achieves better performance than conventional pipeline methods on the DIRHA English dataset and comparable performance on the REVERB dataset. It also has additional advantages of being neither iterative nor requiring parallel noisy and clean speech data.
ASOct 25, 2018
Speaker Selective Beamformer with Keyword Mask EstimationYusuke Kida, Dung Tran, Motoi Omachi et al.
This paper addresses the problem of automatic speech recognition (ASR) of a target speaker in background speech. The novelty of our approach is that we focus on a wakeup keyword, which is usually used for activating ASR systems like smart speakers. The proposed method firstly utilizes a DNN-based mask estimator to separate the mixture signal into the keyword signal uttered by the target speaker and the remaining background speech. Then the separated signals are used for calculating a beamforming filter to enhance the subsequent utterances from the target speaker. Experimental evaluations show that the trained DNN-based mask can selectively separate the keyword and background speech from the mixture signal. The effectiveness of the proposed method is also verified with Japanese ASR experiments, and we confirm that the character error rates are significantly improved by the proposed method for both simulated and real recorded test sets.