Chia-Wei Chen

h-index8
2papers

2 Papers

SDJan 23, 2025
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement

Meng-Ping Lin, Jen-Cheng Hou, Chia-Wei Chen et al.

Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments. Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance. Given that human speech communication naturally involves audio, visual, and linguistic modalities, it is reasonable to expect additional improvements by integrating linguistic information. However, effectively bridging these modality gaps, particularly during knowledge transfer remains a significant challenge. In this paper, we propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information for audio-visual speech enhancement (AVSE). Within this framework, the linguistic modality is modeled using a pretrained language model (PLM), which transfers linguistic knowledge to the audio-visual domain through a cross-modal knowledge transfer (CMKT) mechanism during training. After training, the PLM is no longer required at inference, as its knowledge is embedded into the AVSE model through the CMKT process. We conduct a series of SE experiments to evaluate the effectiveness of our approach. Results show that the proposed DLAV-SE system significantly improves speech quality and reduces generative artifacts, such as phonetic confusion, compared to state-of-the-art (SOTA) methods. Furthermore, visualization analyses confirm that the CMKT method enhances the generation quality of the AVSE outputs. These findings highlight both the promise of diffusion-based methods for advancing AVSE and the value of incorporating linguistic information to further improve system performance.

IRDec 5, 2018
Enriching Article Recommendation with Phrase Awareness

Chia-Wei Chen, Sheng-Chuan Chou, Lun-Wei Ku

Recent deep learning methods for recommendation systems are highly sophisticated. For article recommendation task, a neural network encoder which generates a latent representation of the article content would prove useful. However, using raw text with embedding for models could degrade sentence meanings and deteriorate performance. In this paper, we propose PhrecSys (Phrase-based Recommendation System), which injects phrase-level features into content-based recommendation systems to enhance feature informativeness and model interpretability. Experiments conducted on six months of real-world data demonstrate that phrase features boost content-based models in predicting both user click and view behavior. Furthermore, the attention mechanism illustrates that phrase awareness benefits the learning of textual focus by putting the model's attention on meaningful text spans, which leads to interpretable article recommendation.