Automated mapping of virtual environments with visual predictive coding
This work provides a unified algorithmic framework for cognitive mapping that could generalize to auditory, tactile, and linguistic inputs, addressing a foundational challenge in AI and neuroscience.
The paper tackles the problem of constructing cognitive maps from sensory inputs by introducing a predictive coding framework that uses a self-attention-equipped convolutional neural network to learn spatial maps from visual data in virtual environments, resulting in an internal representation that quantitatively reflects distances and enables location pinpointing.
Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. While machine learning algorithms like SLAM utilize specialized visual inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a self-attention-equipped convolutional neural network. While learning a next image prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation where individual latent space units delineate localized, overlapping neighborhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.