CVOct 22, 2025

BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

arXiv:2510.19332v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of balancing semantic accuracy and visual detail in brain image decoding for neuroscience applications, representing an incremental improvement over existing methods.

The paper tackles the problem of decoding images from fMRI by introducing BrainMCLIP, a parameter-efficient method that fuses multiple CLIP layers guided by the brain's functional hierarchy, eliminating the need for a separate VAE pipeline. Results show it achieves competitive performance, particularly on high-level semantic metrics, while reducing parameters by 71.7% compared to top VAE-based methods.

Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7\%(Table.\ref{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes