CVJul 13, 2021

Per-Pixel Classification is Not All You Need for Semantic Segmentation

arXiv:2107.06278v22034 citations
Originality Highly original
AI Analysis

It simplifies segmentation methods for computer vision by addressing the challenge of large class counts, though it is incremental as it builds on existing mask classification ideas.

The paper tackles semantic segmentation by proposing a unified mask classification approach that handles both semantic- and instance-level tasks, outperforming per-pixel baselines with 55.6 mIoU on ADE20K and 52.7 PQ on COCO.

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes