CV IVDec 17, 2025

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena, Pratik Chaudhari, James C. Gee

arXiv:2512.15505v11 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This work critically assesses a benchmark for neuroimaging registration, highlighting limitations in deep learning methods for clinical workflows, and is incremental as it builds on existing domain shift literature.

The paper re-evaluates claims of zero-shot generalization in deformable image registration, finding that deep learning methods perform well on in-distribution data but degrade significantly on out-of-distribution contrasts with Cohen's d scores of 0.7-1.5 and fail on high-resolution images.

The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

View on arXiv PDF

Similar