CVAIMay 31, 2025

From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

arXiv:2506.00718v15 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding how training conditions affect global perceptual abilities in vision models, with implications for AI vision research, though it is incremental in linking model behaviors to biological principles.

The study investigated whether modern vision models exhibit Gestalt-like perceptual organization, finding that self-supervised models like MAE-trained Vision Transformers show activation patterns consistent with Gestalt principles and sometimes exceed human performance on global spatial sensitivity tests.

Human vision organizes local cues into coherent global forms using Gestalt principles like closure, proximity, and figure-ground assignment -- functions reliant on global spatial structure. We investigate whether modern vision models show similar behaviors, and under what training conditions these emerge. We find that Vision Transformers (ViTs) trained with Masked Autoencoding (MAE) exhibit activation patterns consistent with Gestalt laws, including illusory contour completion, convexity preference, and dynamic figure-ground segregation. To probe the computational basis, we hypothesize that modeling global dependencies is necessary for Gestalt-like organization. We introduce the Distorted Spatial Relationship Testbench (DiSRT), which evaluates sensitivity to global spatial perturbations while preserving local textures. Using DiSRT, we show that self-supervised models (e.g., MAE, CLIP) outperform supervised baselines and sometimes even exceed human performance. ConvNeXt models trained with MAE also exhibit Gestalt-compatible representations, suggesting such sensitivity can arise without attention architectures. However, classification finetuning degrades this ability. Inspired by biological vision, we show that a Top-K activation sparsity mechanism can restore global sensitivity. Our findings identify training conditions that promote or suppress Gestalt-like perception and establish DiSRT as a diagnostic for global structure sensitivity across models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes