Learning a Weight Map for Weakly-Supervised Localization
This addresses the challenge of localizing objects without detailed annotations, which is useful for applications like fine-grained categorization, but it is incremental as it builds on existing weakly-supervised techniques.
The paper tackles the problem of weakly-supervised localization using only image-level labels by training a generative network to produce a per-pixel weight map that indicates object locations, and it outperforms existing methods by a sizable margin on fine-grained classification and generic image recognition datasets.
In the weakly supervised localization setting, supervision is given as an image-level label. We propose to employ an image classifier $f$ and to train a generative network $g$ that outputs, given the input image, a per-pixel weight map that indicates the location of the object within the image. Network $g$ is trained by minimizing the discrepancy between the output of the classifier $f$ on the original image and its output given the same image weighted by the output of $g$. The scheme requires a regularization term that ensures that $g$ does not provide a uniform weight, and an early stopping criterion in order to prevent $g$ from over-segmenting the image. Our results indicate that the method outperforms existing localization methods by a sizable margin on the challenging fine-grained classification datasets, as well as a generic image recognition dataset. Additionally, the obtained weight map is also state-of-the-art in weakly supervised segmentation in fine-grained categorization datasets.