MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech
This work addresses a practical problem for emergency call-takers by improving decision support systems, though it is incremental as it applies known multimodal techniques to a specific domain.
The paper tackles real-time question labeling in emergency medical service calls by proposing a multimodal approach that jointly learns from streamed audio and noisy transcriptions, showing significant gains over single-modality methods under adverse noise and limited data.
We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalities or views, as it jointly learns from streamed audio and its noisy transcription into text via automatic speech recognition. Our results show significant gains of jointly learning from the two modalities when compared to text or audio only, under adverse noise and limited volume of training data. The results generalize to medical symptoms detection where we observe a similar pattern of improvements with multimodal learning.