CVLGMMJan 10, 2023

Vision Transformers Are Good Mask Auto-Labelers

arXiv:2301.03992v124 citationsh-index: 58
Originality Highly original
AI Analysis

This addresses the need for efficient instance segmentation without costly human mask annotations, representing a strong incremental advance in box-supervised methods.

The paper tackles the problem of instance segmentation with only box annotations by proposing Mask Auto-Labeler (MAL), a Transformer-based framework that generates high-quality mask pseudo-labels, achieving up to 97.4% performance of fully supervised models and 44.1% mAP on COCO.

We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4\% performance of fully supervised models. The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes