CVAug 20, 2024

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

arXiv:2408.10541v1h-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the multi-modal task of segmenting objects in videos based on natural language expressions, but it is incremental as it builds on existing methods like DETR and SAM for a specific challenge.

The paper tackled the problem of Referring Video Object Segmentation by building two instance-centric models and fusing their predictions, achieving a score of 52.67 J&F in validation and 60.36 J&F in test to secure 3rd place in the LSVOS Challenge RVOS Track.

Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes