Traces of Image Memorability in Vision Encoders: Activations, Attention Distributions and Autoencoder Losses
This work addresses the problem of predicting image memorability for applications in computer vision and cognitive science, but it is incremental as it builds on existing methods with modest improvements.
The paper investigates whether features from pretrained vision encoders, such as latent activations and attention distributions, correlate with human image memorability, finding that these features are informative predictors and that sparse autoencoder loss on vision transformer representations outperforms previous CNN-based methods.
Images vary in how memorable they are to humans. Inspired by findings from cognitive science and computer vision, this paper explores the correlates of image memorability in pretrained vision encoders, focusing on latent activations, attention distributions, and the uniformity of image patches. We find that these features correlate with memorability to some extent. Additionally, we explore sparse autoencoder loss over the representations of vision transformers as a proxy for memorability, which yields results outperforming past methods using convolutional neural network representations. Our results shed light on the relationship between model-internal features and memorability. They show that some features are informative predictors of what makes images memorable to humans.