High-Level Perceptual Similarity is Enabled by Learning Diverse Tasks
This work addresses the challenge of predicting human perceptual similarity for machine vision, but it is incremental as it builds on existing methods without introducing a new paradigm.
The paper tackled the problem of predicting human perceptual similarity by hypothesizing it emerges as a byproduct of learning diverse visual tasks, and achieved results that significantly surpass recent baselines on the Totally-Looks-Like benchmark, closing much of the reported gap.
Predicting human perceptual similarity is a challenging subject of ongoing research. The visual process underlying this aspect of human vision is thought to employ multiple different levels of visual analysis (shapes, objects, texture, layout, color, etc). In this paper, we postulate that the perception of image similarity is not an explicitly learned capability, but rather one that is a byproduct of learning others. This claim is supported by leveraging representations learned from a diverse set of visual tasks and using them jointly to predict perceptual similarity. This is done via simple feature concatenation, without any further learning. Nevertheless, experiments performed on the challenging Totally-Looks-Like (TLL) benchmark significantly surpass recent baselines, closing much of the reported gap towards prediction of human perceptual similarity. We provide an analysis of these results and discuss them in a broader context of emergent visual capabilities and their implications on the course of machine-vision research.