AS LG SPMar 12, 2024

Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets

Jan Pešán, Santosh Kesiraju, Lukáš Burget, Jan ''Honza'' Černocký

arXiv:2403.07767v21.2h-index: 10

Originality Incremental advance

AI Analysis

This study highlights a critical flaw in dataset reliability for speech recognition researchers, potentially undermining progress in paralinguistic analysis, and is incremental as it builds on existing concerns about data integrity.

This paper investigates the assumption that machine learning models trained on paralinguistic speech datasets like CLSE and IEMOCAP learn to identify traits such as cognitive load and emotion, revealing significant text-dependency where models may capture lexical features instead, with findings indicating that large pre-trained models like HuBERT are particularly prone to this issue.

Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets is seldom scrutinized for text-dependency. This paper critically evaluates the prevalent assumption that machine learning models trained on such datasets genuinely learn to identify paralinguistic traits, rather than merely capturing lexical features. By examining the lexical overlap in these datasets and testing the performance of machine learning models, we expose significant text-dependency in trait-labeling. Our results suggest that some machine learning models, especially large pre-trained models like HuBERT, might inadvertently focus on lexical characteristics rather than the intended paralinguistic features. The study serves as a call to action for the research community to reevaluate the reliability of existing datasets and methodologies, ensuring that machine learning models genuinely learn what they are designed to recognize.

View on arXiv PDF

Similar