CVApr 18, 2025

Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang Wang

arXiv:2504.13710v13 citationsh-index: 6Has CodeInt J Comput Vis

Originality Incremental advance

AI Analysis

This addresses video object segmentation for applications like video editing or autonomous systems, but appears incremental as it builds on existing RVOS methods with specific enhancements.

The paper tackles the problem of segmenting objects in videos using natural language descriptions, proposing FS-RVOS and FS-RVMOS models that outperform state-of-the-art methods on benchmarks with improved robustness and accuracy.

Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

View on arXiv PDF Code

Similar