Beyond Accuracy: Uncovering the Role of Similarity Perception and its Alignment with Semantics in Supervised Learning
This work addresses the lack of focus on similarity perception emergence in deep vision, which is important for researchers in computer vision and machine learning, though it is incremental in nature.
The paper tackles the problem of understanding how deep vision networks develop similarity perception and its alignment with semantic similarity, introducing the Deep Similarity Inspector (DSI) framework. The results show that both CNNs and ViTs develop rich similarity perception in three phases, with clear differences between them, and observe phenomena like gradual mistake elimination and refinement.
Similarity manifests in various forms, including semantic similarity that is particularly important, serving as an approximation of human object categorization based on e.g. shared functionalities and evolutionary traits. It also offers practical advantages in computational modeling via lexical structures such as WordNet with constant and interpretable similarity. As in the domain of deep vision, there is still not enough focus on the phenomena regarding the similarity perception emergence. We introduce Deep Similarity Inspector (DSI) -- a systematic framework to inspect how deep vision networks develop their similarity perception and its alignment with semantic similarity. Our experiments show that both Convolutional Neural Networks' (CNNs) and Vision Transformers' (ViTs) develop a rich similarity perception during training with 3 phases (initial similarity surge, refinement, stabilization), with clear differences between CNNs and ViTs. Besides the gradual mistakes elimination, the mistakes refinement phenomenon can be observed.