CVLGApr 17, 2025

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

arXiv:2504.13122v113 citationsh-index: 27Has CodeICML
Originality Incremental advance
AI Analysis

This addresses video-language misalignment problems for video understanding tasks, offering a novel method but likely incremental in the context of preference optimization for videos.

The paper tackles misalignment and hallucination issues in Large Video Models by introducing VistaDPO, a framework for hierarchical spatial-temporal direct preference optimization, which improves performance on benchmarks like Video Hallucination and Video QA, with experiments showing significant gains.

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes