Everything is a Video: Unifying Modalities through Next-Frame Prediction
This approach simplifies multimodal model design for tasks like visual question answering and cross-modal retrieval, potentially laying groundwork for generalized foundation models, though it appears incremental as an extension of task reformulation from NLP to multimodal contexts.
The paper tackles the problem of scalability and flexibility in multimodal learning by reformulating diverse tasks into a unified next-frame prediction problem, enabling a single model to handle text, images, audio, and video without modality-specific components and demonstrating generalization across tasks with minimal adaptation.
Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model's ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.