Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
This work addresses the need for more accurate and versatile video understanding models for applications in AI and computer vision, representing a strong incremental advancement.
The paper tackles the problem of generating detailed video descriptions and achieving comprehensive video understanding with Tarsier2, a large vision-language model, resulting in outperforming leading proprietary models like GPT-4o and Gemini 1.5 Pro by up to 5.8% in F1 on benchmarks and setting new state-of-the-art results across 15 public benchmarks.
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6% performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.