CV AI NCJul 26, 2025

Predicting Brain Responses To Natural Movies With Multimodal LLMs

Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti

arXiv:2507.19956v15 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of generalizing brain encoding models to novel stimuli for neuroscience and AI researchers, but it is incremental as it builds on existing methods with comprehensive tuning.

The authors tackled the problem of predicting brain responses to natural movies by combining multimodal features from pretrained models and using a lightweight encoder with ensembling, achieving a mean Pearson's correlation of 0.2085 on test data and placing fourth in the Algonauts 2025 challenge.

We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.

View on arXiv PDF

Similar