CVAIApr 29, 2024

Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain

arXiv:2404.18772v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of interpreting deep learning models for researchers in AI and neuroscience, though it is incremental in comparing architectures and training objectives.

The study investigated how neural networks transform visual input into semantic understanding, finding that ResNets are more sensitive to saliency than ViTs, and CLIP enhances semantic encoding and saliency suppression in ResNets.

Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes