CLAug 25, 2025

Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs

arXiv:2508.17914v1h-index: 3
Originality Synthesis-oriented
AI Analysis

This is an incremental analysis for speech recognition researchers, focusing on evaluating phonetic representation in a specific model component.

This study tackled the problem of understanding how vowels are represented in the CNN feature extractor of Wav2Vec by comparing it to MFCC-based features, finding that CNN activations achieved higher classification accuracy for front-back vowel identification on the TIMIT corpus.

Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes