Modality-Agnostic fMRI Decoding of Vision and Language
This work addresses the challenge of understanding how the brain processes multiple types of information, offering insights for neuroscience and brain-computer interfaces, but it is incremental as it builds on prior cross-modal decoding studies.
The authors tackled the problem of decoding brain activation from fMRI data across different stimulus modalities (images and text) using a new large-scale dataset, finding that a single modality-agnostic decoder performs as well as or better than modality-specific decoders and works effectively with unimodal or multimodal model representations.
Previous studies have shown that it is possible to map brain activation data of subjects viewing images onto the feature representation space of not only vision models (modality-specific decoding) but also language models (cross-modal decoding). In this work, we introduce and use a new large-scale fMRI dataset (~8,500 trials per subject) of people watching both images and text descriptions of such images. This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing, irrespective of the modality (image or text) in which the stimulus is presented. We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models. Our findings reveal that (1) modality-agnostic decoders perform as well as (and sometimes even better than) modality-specific decoders (2) modality-agnostic decoders mapping brain data onto representations from unimodal models perform as well as decoders relying on multimodal representations (3) while language and low-level visual (occipital) brain regions are best at decoding text and image stimuli, respectively, high-level visual (temporal) regions perform well on both stimulus types.