CVMar 29

Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?

arXiv:2603.2769732.31 citationsh-index: 29
Predicted impact top 84% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For practitioners of video semantic segmentation, this work offers a practical method to reduce annotation costs while maintaining performance, though the approach is incremental.

The paper investigates using unsupervised segmentation models (SAM, SAM 2) to reduce annotation costs for video semantic segmentation, finding that annotation needs can be reduced by a third with similar performance, and that frame variety is more important than quantity.

Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes