Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
This work addresses the challenge of creating more natural and emotionally aware conversational systems, though it is incremental as it builds on existing speech models.
The paper tackled the problem of generating expressive speech by integrating full-face visual cues into a pre-trained model, resulting in substantial gains such as +5 F1 in emotion recognition over speech-only baselines.
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.