Understanding Cross-Model Perceptual Invariances Through Ensemble Metamers
This work addresses the challenge of improving explainability and alignment with human vision in AI models, though it is incremental in exploring architectural biases.
The paper tackled the problem of understanding perceptual invariances in artificial neural networks by generating metamers using ensembles of diverse architectures, finding that convolutional neural networks produce more recognizable and human-like metamers than vision transformers.
Understanding the perceptual invariances of artificial neural networks is essential for improving explainability and aligning models with human vision. Metamers - stimuli that are physically distinct yet produce identical neural activations - serve as a valuable tool for investigating these invariances. We introduce a novel approach to metamer generation by leveraging ensembles of artificial neural networks, capturing shared representational subspaces across diverse architectures, including convolutional neural networks and vision transformers. To characterize the properties of the generated metamers, we employ a suite of image-based metrics that assess factors such as semantic fidelity and naturalness. Our findings show that convolutional neural networks generate more recognizable and human-like metamers, while vision transformers produce realistic but less transferable metamers, highlighting the impact of architectural biases on representational invariances.