CVLGJan 10, 2022

Reproducing BowNet: Learning Representations by Predicting Bags of Visual Words

arXiv:2201.03556v2
AI Analysis

It addresses the reproducibility of a self-supervised learning technique for computer vision, but is incremental as it focuses on replication rather than new contributions.

This work attempted to reproduce the self-supervised learning method BowNet, which predicts bag-of-words features from perturbed images to learn robust representations for downstream tasks, but failed to achieve the reported accuracy improvements on CIFAR-100.

This work aims to reproduce results from the CVPR 2020 paper by Gidaris et al. Self-supervised learning (SSL) is used to learn feature representations of an image using an unlabeled dataset. This work proposes to use bag-of-words (BoW) deep feature descriptors as a self-supervised learning target to learn robust, deep representations. BowNet is trained to reconstruct the histogram of visual words (ie. the deep BoW descriptor) of a reference image when presented a perturbed version of the image as input. Thus, this method aims to learn perturbation-invariant and context-aware image features that can be useful for few-shot tasks or supervised downstream tasks. In the paper, the author describes BowNet as a network consisting of a convolutional feature extractor $Φ(\cdot)$ and a Dense-softmax layer $Ω(\cdot)$ trained to predict BoW features from images. After BoW training, the features of $Φ$ are used in downstream tasks. For this challenge we were trying to build and train a network that could reproduce the CIFAR-100 accuracy improvements reported in the original paper. However, we were unsuccessful in reproducing an accuracy improvement comparable to what the authors mentioned. This could be for a variety of factors and we believe that time constraints were the primary bottleneck.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes