Yu-Wen Chen

AS
11papers
102citations
Novelty40%
AI Score38

11 Papers

NEDec 11, 2022
BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm

Yu-Wen Chen, Hsin-Min Wang, Yu Tsao

The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus.

LGMar 9, 2022
Investigation of Factorized Optical Flows as Mid-Level Representations

Hsuan-Kung Yang, Tsu-Ching Hsiao, Ting-Hsuan Liao et al.

In this paper, we introduce a new concept of incorporating factorized flow maps as mid-level representations, for bridging the perception and the control modules in modular learning based robotic frameworks. To investigate the advantages of factorized flow maps and examine their interplay with the other types of mid-level representations, we further develop a configurable framework, along with four different environments that contain both static and dynamic objects, for analyzing the impacts of factorized optical flow maps on the performance of deep reinforcement learning agents. Based on this framework, we report our experimental results on various scenarios, and offer a set of analyses to justify our hypothesis. Finally, we validate flow factorization in real world scenarios.

CLAug 24, 2023
MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios

Yu-Wen Chen, Zhou Yu, Julia Hirschberg

Pronunciation assessment models designed for open response scenarios enable users to practice language skills in a manner similar to real-life communication. However, previous open-response pronunciation assessment models have predominantly focused on a single pronunciation task, such as sentence-level accuracy, rather than offering a comprehensive assessment in various aspects. We propose MultiPA, a Multitask Pronunciation Assessment model that provides sentence-level accuracy, fluency, prosody, and word-level accuracy assessment for open responses. We examined the correlation between different pronunciation tasks and showed the benefits of multi-task learning. Our model reached the state-of-the-art performance on existing in-domain data sets and effectively generalized to an out-of-domain dataset that we newly collected. The experimental results demonstrate the practical utility of our model in real-world applications.

25.7OCApr 20
Target Mirror Descent: A Unifying Framework for Solving Monotone Variational Inequalities

Yu-Wen Chen, Can Kizilkale, Murat Arcak

It is well known that mirror descent may diverge or cycle on merely monotone variational inequalities. In this paper, we propose \emph{Target Mirror Descent} (TMD), a unified framework that stabilizes monotone flows via a target point correction mechanism in the dual update. By appropriate design choices, TMD recovers the proximal point algorithm, extragradient methods, splitting methods, Brown-von Neumann-Nash dynamics, forward-backward-forward dynamics, and discounted mirror descent as special cases. Thus, we establish a unified perspective on these landmark algorithms and their convergence. Beyond unification, we leverage the TMD framework to correct an equilibrium misalignment in discounted mirror descent and to generalize its higher-order extension beyond interior solutions. Moreover, a key structural feature of TMD is the explicit decoupling of the mirror map from the target determination, which enables \emph{geometric ensembles}: multiple algorithms solve the same problem in parallel using distinct mirror maps, while sharing a common dual update. We show that such an ensemble rigorously reduces to a single TMD with a synthesized mirror map, and thus inherits these convergence guarantees.

CLJun 5, 2024
Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes

Yu-Wen Chen, Julia Hirschberg

Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan (P) notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.

ASSep 3, 2023
Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement

Yu-Wen Chen, Julia Hirschberg, Yu Tsao

Speech emotion recognition (SER) often experiences reduced performance due to background noise. In addition, making a prediction on signals with only background noise could undermine user trust in the system. In this study, we propose a Noise Robust Speech Emotion Recognition system, NRSER. NRSER employs speech enhancement (SE) to effectively reduce the noise in input signals. Then, the signal-to-noise-ratio (SNR)-level detection structure and waveform reconstitution strategy are introduced to reduce the negative impact of SE on speech signals with no or little background noise. Our experimental results show that NRSER can effectively improve the noise robustness of the SER system, including preventing the system from making emotion recognition on signals consisting solely of background noise. Moreover, the proposed SNR-level detection structure can be used individually for tasks such as data selection.

SDNov 4, 2021
InQSS: a speech intelligibility and quality assessment model using a multi-task learning network

Yu-Wen Chen, Yu Tsao

Speech intelligibility and quality assessment models are essential tools for researchers to evaluate and improve speech processing models. However, only a few studies have investigated multi-task models for intelligibility and quality assessment due to the limitations of available data. In this study, we released TMHINT-QI, the first Chinese speech dataset that records the quality and intelligibility scores of clean, noisy, and enhanced utterances. Then, we propose InQSS, a non-intrusive multi-task learning framework for intelligibility and quality assessment. We evaluated the InQSS on both the training-from-scratch and the pretrained models. The experimental results confirm the effectiveness of the InQSS framework. In addition, the resulting model can predict not only the intelligibility scores but also the quality scores of a speech signal.

ASApr 7, 2021
The AS-NU System for the M2VoC Challenge

Cheng-Hung Hu, Yi-Chiao Wu, Wen-Chin Huang et al.

This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC). The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning. Due to the serious lack of data in the second track, we selected the speaker most similar to the target speaker from the training data of the TTS system, and used the speaker's utterances and the given 5 target utterances to fine-tune our model. The evaluation results show that our systems on the two tracks perform similarly in terms of quality, but there is still a clear gap between the similarity score of the second track and the similarity score of the first track.

ASFeb 7, 2021
EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang et al.

Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.

AO-PHDec 18, 2020
Investigating Ground-level Ozone Formation: A Case Study in Taiwan

Yu-Wen Chen, Sourav Medya, Yi-Chun Chen

Tropospheric ozone (O3) is a greenhouse gas which can absorb heat and make the weather even hotter during extreme heatwaves. Besides, it is an influential ground-level air pollutant which can severely damage the environment. Thus evaluating the importance of various factors related to the O3 formation process is essential. However, O3 simulated by the available climate models exhibits large variance in different places, indicating the insufficiency of models in explaining the O3 formation process correctly. In this paper, we aim to identify and understand the impact of various factors on O3 formation and predict the O3 concentrations under different pollution-reduced and climate change scenarios. We employ six supervised methods to estimate the observed O3 using fourteen meteorological and chemical variables. We find that the deep neural network (DNN) and long short-term memory (LSTM) based models can predict O3 concentrations accurately. We also demonstrate the importance of several variables in this prediction task. The results suggest that while Nitrogen Oxides negatively contributes to predicting O3, solar radiation makes a significantly positive contribution. Furthermore, we apply our two best models on O3 prediction under different global warming and pollution reduction scenarios to improve the policy-making decisions in the O3 reduction.

ASAug 21, 2020
CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li et al.

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.