CVCLMay 4, 2023

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

arXiv:2305.03204v14 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating accurate text from videos for applications like captioning and question answering, representing an incremental advance in pre-training methods for video-language models.

The authors tackled video-to-text generation by proposing a two-stage pre-training framework that first learns vision-language concepts from image-text data and then adapts to video data, achieving new state-of-the-art performance with an average 9.7-point CIDEr score improvement on video captioning benchmarks and outperforming models on video question answering.

We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and video question answering: A generative encoder-decoder model is first jointly pre-trained on massive image-text data to learn fundamental vision-language concepts, and then adapted to video data in an intermediate video-text pre-training stage to learn video-specific skills such as spatio-temporal reasoning. As a result, our VideoOFA model achieves new state-of-the-art performance on four Video Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr score. It also outperforms existing models on two open-ended Video Question Answering datasets, showcasing its generalization capability as a universal video-to-text model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes