Chung-Che Wang

SD
h-index13
6papers
24citations
Novelty48%
AI Score33

6 Papers

CVOct 27, 2022
Multimodal Transformer Distillation for Audio-Visual Synchronization

Xuanjun Chen, Haibin Wu, Chung-Che Wang et al.

Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST. Additionally, we harness uncertainty weighting to fully exploit the interaction information across all layers. Our proposed method is effective in two aspects: From the distillation method perspective, MTD loss outperforms other strong distillation baselines. From the distilled model's performance perspective: 1) MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match models by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining similar performance.

SDFeb 16, 2023
Personalized Audio Quality Preference Prediction

Chung-Che Wang, Yu-Chun Lin, Yu-Teng Hsu et al.

This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer perceptron block as the decoder outperforms a baseline model using only audio input the most, where the overall accuracy grows from 77.56% to 78.04%. Experimental results also show that using all the subject information, including age, gender, and the specifications of headphones or earphones, is more effective than using only a part of them.

SDAug 4, 2025
Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Xuanjun Chen, Shih-Peng Cheng, Jiawei Du et al.

Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.

LGDec 30, 2024
Improving Location-based Thermal Emission Side-Channel Analysis Using Iterative Transfer Learning

Tun-Chieh Lou, Chung-Che Wang, Jyh-Shing Roger Jang et al.

This paper proposes the use of iterative transfer learning applied to deep learning models for side-channel attacks. Currently, most of the side-channel attack methods train a model for each individual byte, without considering the correlation between bytes. However, since the models' parameters for attacking different bytes may be similar, we can leverage transfer learning, meaning that we first train the model for one of the key bytes, then use the trained model as a pretrained model for the remaining bytes. This technique can be applied iteratively, a process known as iterative transfer learning. Experimental results show that when using thermal or power consumption map images as input, and multilayer perceptron or convolutional neural network as the model, our method improves average performance, especially when the amount of data is insufficient.

SDJun 30, 2024
Improving Real-Time Music Accompaniment Separation with MMDenseNet

Chun-Hsiang Wang, Chung-Che Wang, Jun-You Wang et al.

Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.

SDMar 31, 2022
Adversarial Speaker Distillation for Countermeasure Model on Automatic Speaker Verification

Yen-Lun Liao, Xuanjun Chen, Chung-Che Wang et al.

The countermeasure (CM) model is developed to protect ASV systems from spoof attacks and prevent resulting personal information leakage in Automatic Speaker Verification (ASV) system. Based on practicality and security considerations, the CM model is usually deployed on edge devices, which have more limited computing resources and storage space than cloud-based systems, confining the model size under a limitation. To better trade off the CM model sizes and performance, we proposed an adversarial speaker distillation method, which is an improved version of knowledge distillation method combined with generalized end-to-end (GE2E) pre-training and adversarial fine-tuning. In the evaluation phase of the ASVspoof 2021 Logical Access task, our proposed adversarial speaker distillation ResNetSE (ASD-ResNetSE) model reaches 0.2695 min t-DCF and 3.54% EER. ASD-ResNetSE only used 22.5% of parameters and 19.4% of multiply and accumulate operands of ResNetSE model.