Facing the Hard Problems in FGVC
This work addresses error analysis and model combination for fine-grained visual categorization researchers, but it is incremental as it builds on existing methods without introducing a new paradigm.
The paper analyzed state-of-the-art methods in fine-grained visual categorization, finding that they struggle with certain hard images and make complementary mistakes, and demonstrated that combining complementary models improves accuracy on the CUB-200 dataset by over 5%.
In fine-grained visual categorization (FGVC), there is a near-singular focus in pursuit of attaining state-of-the-art (SOTA) accuracy. This work carefully analyzes the performance of recent SOTA methods, quantitatively, but more importantly, qualitatively. We show that these models universally struggle with certain "hard" images, while also making complementary mistakes. We underscore the importance of such analysis, and demonstrate that combining complementary models can improve accuracy on the popular CUB-200 dataset by over 5%. In addition to detailed analysis and characterization of the errors made by these SOTA methods, we provide a clear set of recommended directions for future FGVC researchers.