CVAICLJun 12, 2023

Valley: Video Assistant with Large Language model Enhanced abilitY

arXiv:2306.07207v3268 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for improved AI assistants in video-text tasks, but it is incremental as it builds on existing LLM and vision encoder methods.

The authors tackled the problem of joint video and language understanding by introducing Valley, a multi-modal foundation model that enhances video comprehension and instruction-following capabilities, achieving potential as an effective video assistant in simplifying complex scenarios.

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes