Quantifying Quanvolutional Neural Networks Robustness for Speech in Healthcare Applications

Ha Tran, Bipasha Kashyap, Pubudu N. Pathirana

arXiv:2601.02432v12.2

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of noise sensitivity in speech-based healthcare applications like emotion recognition and voice pathology detection, but it is incremental as it systematically compares existing quantum and classical models without introducing a new method.

The paper evaluated the robustness of quantum machine learning models (quanvolutional neural networks) compared to classical convolutional neural networks under acoustic corruptions for speech tasks like emotion recognition and voice pathology detection, finding that QNNs outperformed a simple CNN baseline in most corruptions (e.g., up to 22% lower error at severe temporal shift) but were less resilient to Gaussian noise, with QNNs also converging up to six times faster.

Speech-based machine learning systems are sensitive to noise, complicating reliable deployment in emotion recognition and voice pathology detection. We evaluate the robustness of a hybrid quantum machine learning model, quanvolutional neural networks (QNNs) against classical convolutional neural networks (CNNs) under four acoustic corruptions (Gaussian noise, pitch shift, temporal shift, and speed variation) in a clean-train/corrupted-test regime. Using AVFAD (voice pathology) and TESS (speech emotion), we compare three QNN models (Random, Basic, Strongly) to a simple CNN baseline (CNN-Base), ResNet-18 and VGG-16 using accuracy and corruption metrics (CE, mCE, RCE, RmCE), and analyze architectural factors (circuit complexity or depth, convergence) alongside per-emotion robustness. QNNs generally outperform the CNN-Base under pitch shift, temporal shift, and speed variation (up to 22% lower CE/RCE at severe temporal shift), while the CNN-Base remains more resilient to Gaussian noise. Among quantum circuits, QNN-Basic achieves the best overall robustness on AVFAD, and QNN-Random performs strongest on TESS. Emotion-wise, fear is most robust (80-90% accuracy under severe corruptions), neutral can collapse under strong Gaussian noise (5.5% accuracy), and happy is most vulnerable to pitch, temporal, and speed distortions. QNNs also converge up to six times faster than the CNN-Base. To our knowledge, this is a systematic study of QNN robustness for speech under common non-adversarial acoustic corruptions, indicating that shallow entangling quantum front-ends can improve noise resilience while sensitivity to additive noise remains a challenge.

View on arXiv PDF

Similar