LGAIOct 29, 2025

Aligning Brain Signals with Multimodal Speech and Vision Embeddings

arXiv:2511.00065v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding how the brain processes language multimodally, which is incremental as it builds on existing alignment methods.

The paper investigated which layers of pre-trained models (wav2vec2 and CLIP) best align with brain activity during speech perception, using EEG data and methods like ridge regression and contrastive decoding. The results suggest that combining multimodal, layer-aware representations improves alignment with brain processing.

When we hear the word "house", we don't just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes