CVDec 13, 2021

5th Place Solution for VSPW 2021 Challenge

arXiv:2112.06379v1
Originality Synthesis-oriented
AI Analysis

This work addresses video semantic segmentation for researchers in computer vision, but it is incremental as it primarily combines existing techniques without major breakthroughs.

The authors tackled the VSPW 2021 Challenge for video semantic segmentation by building on Swin Transformer and MaskFormer baselines, using stochastic weight averaging and a hierarchical ensemble strategy, achieving 5th place on the private leaderboard without external datasets.

In this article, we introduce the solution we used in the VSPW 2021 Challenge. Our experiments are based on two baseline models, Swin Transformer and MaskFormer. To further boost performance, we adopt stochastic weight averaging technique and design hierarchical ensemble strategy. Without using any external semantic segmentation dataset, our solution ranked the 5th place in the private leaderboard. Besides, we have some interesting attempts to tackle long-tail recognition and overfitting issues, which achieves improvement on val subset. Maybe due to distribution difference, these attempts don't work on test subset. We will also introduce these attempts and hope to inspire other researchers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes