CVSep 15, 2022

Visual Recognition with Deep Nearest Centroids

arXiv:2209.07383v2189 citationsh-index: 77
Originality Incremental advance
AI Analysis

This work addresses the need for more interpretable and efficient deep learning models in computer vision, offering a novel approach that is incremental in combining classic methods with modern architectures.

The paper tackles the problem of improving simplicity, explainability, and transferability in deep visual recognition by introducing deep nearest centroids (DNC), a nonparametric classifier based on classic nearest centroids. It shows that DNC outperforms parametric models on image classification (CIFAR-10, ImageNet) and significantly boosts pixel recognition (ADE20K, Cityscapes) with fewer parameters and enhanced transparency.

We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking simplicity and explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data and the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the "pre-training and fine-tuning" paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes), with improved transparency and fewer learnable parameters, using various network architectures (ResNet, Swin) and segmentation models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights into related fields.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes