CVMar 30, 2025

ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

arXiv:2503.23509v21 citationsh-index: 7Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses RVOS for applications like video editing and human-agent interaction, but it is incremental as it builds on existing methods with minor improvements.

The paper tackles referring video object segmentation by enhancing ReferDINO with SAM2 for better mask quality and object consistency, and introduces a conditional mask fusion strategy to balance single- and multi-object scenarios, achieving 60.43 J&F on the MeViS test set and securing 2nd place in the challenge.

Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 \(\mathcal{J}\&\mathcal{F}\) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes