Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
This challenges the assumption that the ventral stream is optimized solely for object categorization, suggesting multiple training objectives can yield similar brain-aligned models, which is important for neuroscientists and AI researchers studying visual processing.
The study investigated whether the primate ventral visual stream might be optimized for estimating spatial latents like object position and pose, rather than just object categorization, by training CNNs on synthetic 3D images to estimate different combinations of spatial and category latents. They found that models trained on just a few spatial latents achieved neural alignment scores comparable to those trained on hundreds of categories, with spatial latent performance strongly correlating with neural alignment.
Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.