CVFeb 4

Temporal Slowness in Central Vision Drives Semantic Object Learning

arXiv:2602.04462v1h-index: 6
AI Analysis

This research addresses the problem of understanding human visual learning mechanisms for AI researchers and cognitive scientists, but it is incremental as it builds on existing methods like time-contrastive learning and gaze prediction.

The study tackled how humans learn semantic object representations from visual experience by simulating five months of human-like visual data from the Ego4D dataset, using gaze predictions to mimic central vision and a time-contrastive Self-Supervised Learning model. The results showed that combining temporal slowness and central vision improved the encoding of semantic object facets, with central vision enhancing foreground object features and temporal slowness, especially during fixational eye movements, encoding broader semantic information.

Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes