CVLGNov 15, 2023

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

arXiv:2311.09215v328 citationsh-index: 30Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge for practitioners in choosing models for specialized tasks by highlighting performance nuances beyond traditional metrics, though it is incremental in nature.

The paper tackled the problem of model selection in computer vision by comparing ConvNet and Vision Transformer architectures under supervised and CLIP training paradigms, finding that models with similar ImageNet accuracies differ in aspects like mistake types, calibration, and transferability.

Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes