CVAug 13, 2019

Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation

arXiv:1908.04501v145 citations
AI Analysis

This work addresses the challenge of limited object coverage in weakly supervised segmentation for computer vision researchers, offering a simple method that outperforms existing approaches and even some with extra annotations.

The paper tackles the problem of weakly supervised semantic segmentation using only image-level labels by leveraging web videos to aggregate activated regions across frames, achieving mIoU scores of 65.0 and 67.4 on PASCAL VOC 2012 with VGG-16 and ResNet 101 backbones, respectively.

When a deep neural network is trained on data with only image-level labeling, the regions activated in each image tend to identify only a small region of the target object. We propose a method of using videos automatically harvested from the web to identify a larger region of the target object by using temporal information, which is not present in the static image. The temporal variations in a video allow different regions of the target object to be activated. We obtain an activated region in each frame of a video, and then aggregate the regions from successive frames into a single image, using a warping technique based on optical flow. The resulting localization maps cover more of the target object, and can then be used as proxy ground-truth to train a segmentation network. This simple approach outperforms existing methods under the same level of supervision, and even approaches relying on extra annotations. Based on VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4, respectively, on PASCAL VOC 2012 test images, which represents a new state-of-the-art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes