CLLGSDASJun 11, 2024

Multimodal Belief Prediction

arXiv:2406.07466v12 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of belief prediction in NLP by incorporating audio cues, offering a domain-specific advancement over existing methods.

The paper tackles the problem of predicting a speaker's belief commitment by introducing a multimodal approach that combines text and audio, improving over text-only and audio-only baselines.

Recognizing a speaker's level of commitment to a belief is a difficult task; humans do not only interpret the meaning of the words in context, but also understand cues from intonation and other aspects of the audio signal. Many papers and corpora in the NLP community have approached the belief prediction task using text-only approaches. We are the first to frame and present results on the multimodal belief prediction task. We use the CB-Prosody corpus (CBP), containing aligned text and audio with speaker belief annotations. We first report baselines and significant features using acoustic-prosodic features and traditional machine learning methods. We then present text and audio baselines for the CBP corpus fine-tuning on BERT and Whisper respectively. Finally, we present our multimodal architecture which fine-tunes on BERT and Whisper and uses multiple fusion methods, improving on both modalities alone.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes