Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching
This addresses video object segmentation for applications like video editing or autonomous systems, but appears incremental as it builds on existing RVOS methods with specific enhancements.
The paper tackles the problem of segmenting objects in videos using natural language descriptions, proposing FS-RVOS and FS-RVMOS models that outperform state-of-the-art methods on benchmarks with improved robustness and accuracy.
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.