Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
This work addresses VAD for speech processing, showing incremental improvements through feature fusion.
This study tackled the problem of Voice Activity Detection (VAD) by proposing FusionVAD, a framework that combines MFCCs and pre-trained model features using simple fusion strategies like addition, which outperformed cross-attention and achieved a 2.04% absolute average improvement over the state-of-the-art Pyannote across multiple datasets.
Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.