CVSDASJan 3, 2025

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

arXiv:2501.01957v4188 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of efficient multimodal interaction for applications requiring real-time vision and speech, though it appears incremental as it builds on existing MLLM frameworks.

The paper tackles the challenge of integrating speech with vision and language in multimodal dialogue systems, proposing a multi-stage training method that achieves near real-time vision and speech interaction while preserving strong vision-language capacity and accelerating response speed.

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes