ELoPE: Fine-Grained Visual Classification with Efficient Localization, Pooling and Embedding
This work addresses the problem of distinguishing visually similar categories for applications like species or model identification, representing an incremental improvement.
The paper tackled fine-grained visual classification by enhancing a backbone CNN with three efficient components, achieving new state-of-the-art recognition accuracies on the Stanford cars and FGVC-Aircraft datasets.
The task of fine-grained visual classification (FGVC) deals with classification problems that display a small inter-class variance such as distinguishing between different bird species or car models. State-of-the-art approaches typically tackle this problem by integrating an elaborate attention mechanism or (part-) localization method into a standard convolutional neural network (CNN). Also in this work the aim is to enhance the performance of a backbone CNN such as ResNet by including three efficient and lightweight components specifically designed for FGVC. This is achieved by using global k-max pooling, a discriminative embedding layer trained by optimizing class means and an efficient bounding box estimator that only needs class labels for training. The resulting model achieves new best state-of-the-art recognition accuracies on the Stanford cars and FGVC-Aircraft datasets.