Beyond Words: Interjection Classification for Improved Human-Computer Interaction
This addresses a gap in making human-computer dialogue more natural by classifying interjections, though it is incremental as it builds on existing ASR and deep learning methods.
The paper tackles the problem of interjections like 'mmm' and 'hmm' being ignored in human-computer interaction by introducing a novel task for interjection classification, resulting in a dataset and a baseline deep learning model with improved accuracy through augmentation techniques.
In the realm of human-computer interaction, fostering a natural dialogue between humans and machines is paramount. A key, often overlooked, component of this dialogue is the use of interjections such as "mmm" and "hmm". Despite their frequent use to express agreement, hesitation, or requests for information, these interjections are typically dismissed as "non-words" by Automatic Speech Recognition (ASR) engines. Addressing this gap, we introduce a novel task dedicated to interjection classification, a pioneer in the field to our knowledge. This task is challenging due to the short duration of interjection signals and significant inter- and intra-speaker variability. In this work, we present and publish a dataset of interjection signals collected specifically for interjection classification. We employ this dataset to train and evaluate a baseline deep learning model. To enhance performance, we augment the training dataset using techniques such as tempo and pitch transformation, which significantly improve classification accuracy, making models more robust. The interjection dataset, a Python library for the augmentation pipeline, baseline model, and evaluation scripts, are available to the research community.