CVJun 17, 2024

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

arXiv:2406.11280v22 citations
AI Analysis

This addresses a key challenge in aligning multimodal models for video tasks, offering a novel method to improve performance, though it appears incremental as an extension of existing preference optimization techniques.

The paper tackled modality misalignment and visual hallucinations in Video Large Multi-modal Models during iterative preference optimization, proposing ISR-DPO to enhance visual grounding, which significantly outperformed state-of-the-art methods on video question answering benchmarks.

Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multi-modal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge's focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. We are committed to open-sourcing our code, models, and datasets to encourage further investigation.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes