CVAug 19, 2025

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

arXiv:2508.13584v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses segmentation accuracy in video analysis for computer vision applications, representing an incremental improvement.

The paper tackles the problem of referring video object segmentation by improving segmentation head design and leveraging a text-to-video diffusion model, achieving state-of-the-art performance on four public benchmarks.

Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes