CV LGJan 10, 2022

Reproducing BowNet: Learning Representations by Predicting Bags of Visual Words

Harry Nguyen, Stone Yun, Hisham Mohammad

arXiv:2201.03556v21.4Has Code

Originality Synthesis-oriented

AI Analysis

It addresses the reproducibility of a self-supervised learning technique for computer vision, but is incremental as it focuses on replication rather than new contributions.

This work attempted to reproduce the self-supervised learning method BowNet, which predicts bag-of-words features from perturbed images to learn robust representations for downstream tasks, but failed to achieve the reported accuracy improvements on CIFAR-100.

This work aims to reproduce results from the CVPR 2020 paper by Gidaris et al. Self-supervised learning (SSL) is used to learn feature representations of an image using an unlabeled dataset. This work proposes to use bag-of-words (BoW) deep feature descriptors as a self-supervised learning target to learn robust, deep representations. BowNet is trained to reconstruct the histogram of visual words (ie. the deep BoW descriptor) of a reference image when presented a perturbed version of the image as input. Thus, this method aims to learn perturbation-invariant and context-aware image features that can be useful for few-shot tasks or supervised downstream tasks. In the paper, the author describes BowNet as a network consisting of a convolutional feature extractor $Φ(\cdot)$ and a Dense-softmax layer $Ω(\cdot)$ trained to predict BoW features from images. After BoW training, the features of $Φ$ are used in downstream tasks. For this challenge we were trying to build and train a network that could reproduce the CIFAR-100 accuracy improvements reported in the original paper. However, we were unsuccessful in reproducing an accuracy improvement comparable to what the authors mentioned. This could be for a variety of factors and we believe that time constraints were the primary bottleneck.

View on arXiv PDF Code

Similar