Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Lexical Information Fusion
This work addresses the problem of enhancing safety and order in public spaces like airports and train stations by detecting conversational escalation, though it is incremental as it builds on existing methods with transfer learning and feature fusion.
The paper tackled the problem of detecting escalation levels from speech in public areas by introducing a system that fuses acoustic-lexical features and uses transfer learning from emotional datasets. The result was a system achieving 81.5% unweighted average recall, significantly outperforming a baseline of 72.2%.
Textual escalation detection has been widely applied to e-commerce companies' customer service systems to pre-alert and prevent potential conflicts. Similarly, in public areas such as airports and train stations, where many impersonal conversations frequently take place, acoustic-based escalation detection systems are also useful to enhance passengers' safety and maintain public order. To this end, we introduce a system based on acoustic-lexical features to detect escalation from speech, Voice Activity Detection (VAD) and label smoothing are adopted to further enhance the performance in our experiments. Considering a small set of training and development data, we also employ transfer learning on several wellknown emotional detection datasets, i.e. RAVDESS, CREMA-D, to learn advanced emotional representations that is then applied to the conversational escalation detection task. On the development set, our proposed system achieves 81.5% unweighted average recall (UAR) which significantly outperforms the baseline with 72.2% UAR.