Guided Zoom: Questioning Network Evidence for Fine-grained Classification
This addresses the need for more reliable and interpretable predictions in fine-grained visual recognition tasks, though it is incremental in method.
The paper tackled the problem of improving fine-grained classification by ensuring models use coherent evidence for predictions, resulting in state-of-the-art accuracy on three benchmark datasets.
We propose Guided Zoom, an approach that utilizes spatial grounding of a model's decision to make more informed predictions. It does so by making sure the model has "the right reasons" for a prediction, defined as reasons that are coherent with those used to make similar correct decisions at training time. The reason/evidence upon which a deep convolutional neural network makes a prediction is defined to be the spatial grounding, in the pixel space, for a specific class conditional probability in the model output. Guided Zoom examines how reasonable such evidence is for each of the top-k predicted classes, rather than solely trusting the top-1 prediction. We show that Guided Zoom improves the classification accuracy of a deep convolutional neural network model and obtains state-of-the-art results on three fine-grained classification benchmark datasets.