CV LGApr 1, 2025

SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

arXiv:2504.01049v11 citationsICIC

Originality Incremental advance

AI Analysis

This addresses a modality gap in human-computer interaction for applications requiring direct audio-visual understanding, though it is incremental as it builds upon the LLaVA architecture.

The paper tackles the problem of Speech-Based Visual Question Answering (SBVQA) by introducing SViQA, a unified speech-vision model that directly processes spoken questions without text transcription, achieving state-of-the-art performance with 75.62% accuracy on the SBVQA benchmark and boosting to 78.85% with speech-text mixed input.

Multimodal models integrating speech and vision hold significant potential for advancing human-computer interaction, particularly in Speech-Based Visual Question Answering (SBVQA) where spoken questions about images require direct audio-visual understanding. Existing approaches predominantly focus on text-visual integration, leaving speech-visual modality gaps underexplored due to their inherent heterogeneity. To this end, we introduce SViQA, a unified speech-vision model that directly processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations: (1) end-to-end speech feature extraction eliminating intermediate text conversion, and (2) cross-modal alignment optimization enabling effective fusion of speech signals with visual content. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance, achieving 75.62% accuracy, and competitive multimodal generalization. Leveraging speech-text mixed input boosts performance to 78.85%, a 3.23% improvement over pure speech input, highlighting SViQA's enhanced robustness and effective cross-modal attention alignment.

View on arXiv PDF

Similar