CVApr 13, 2025

FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

arXiv:2504.09507v1h-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of accurate video object segmentation in challenging scenes for computer vision applications, but it is incremental as it builds on existing methods with optimizations.

The paper tackled improving Video Object Segmentation (VOS) in complex real-world scenarios by fine-tuning existing methods and adding post-processing and fusion strategies, achieving J&F scores of 76.81% in validation and 83.92% in testing to secure third place in a challenge.

Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes