CV AINov 22, 2023

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan

arXiv:2311.13435v222.654 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of video understanding for AI researchers by enabling spatial object localization in videos, though it builds incrementally on existing image-based models.

The authors tackled the challenge of extending image-based Large Multimodal Models to videos by proposing PG-Video-LLaVA, the first model with pixel-level grounding capability that integrates audio cues, resulting in promising gains on video-based conversation and grounding tasks.

Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA

View on arXiv PDF Code

Similar