CV MLMay 30

A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery

Olga Esteban-Sinovas, Santiago Cepeda, Ignacio Arrese, Rosario Sarabia

arXiv:2606.0063021.3h-index: 18

Predicted impact top 90% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and clinicians using intraoperative ultrasound in brain tumour surgery, this benchmark provides evidence that perceptual metrics and downstream task performance should guide model selection rather than traditional image similarity metrics like SSIM.

This paper presents the first systematic benchmark of six MRI-synthesis architectures from intraoperative ultrasound across 48 experiments on the ReMIND dataset, finding that no single architecture dominates and that perceptual quality (LPIPS) correlates with downstream segmentation utility (r=-0.66) while SSIM correlates negatively (r=-0.64), with SynDiff-2.5D achieving the best downstream Dice (0.55).

Intraoperative ultrasound (ioUS) is a versatile, cost-effective modality in brain tumour surgery, but its interpretation is difficult: acquisition planes are non-standard, artefacts are modality-specific, and its appearance differs markedly from the preoperative MRI on which surgical-planning tools, segmentation models and the surgeon's experience rely. Synthesising MRI-like images from ioUS could let this MRI-based infrastructure be reused intraoperatively without an extra scan. Most prior work evaluates a single architecture in isolation; to our knowledge, no benchmark has spanned architectural paradigms, inference regimes and downstream-task endpoints under a common protocol. We address this gap on the public ReMIND data set (76 patients; 153 paired ioUS/T2w and 104 paired ioUS/FLAIR studies; 60/16 patient-level train/held-out split). Six generators (four GAN baselines: Pix2Pix, SwinPix2Pix, CycleGAN, CUT; the transformer-augmented ResViT; and the few-step diffusion model SynDiff) were each trained under four inference regimes (2D, 2.5D, 2D + 3D-refinement, full-3D) and two targets (T2w only; T2w + FLAIR multi-task), yielding 48 experiments. Image-fidelity metrics (SSIM, PSNR, MAE, LPIPS) were complemented by an nnU-Net v2 downstream segmentation evaluation (tumour and resection cavity) and by subgroup analyses by histological grade and reoperation. No architecture dominated every axis, and, critically, perceptual quality tracked downstream utility most closely (LPIPS, r=-0.66, p<0.001), whereas higher SSIM was associated with worse utility (r=-0.64, p<0.001); SynDiff-2.5D best preserved downstream segmentation (U_Dice=0.55). Perceptual and downstream-task metrics should therefore be reported alongside or in preference to global SSIM, and architecture choice conditioned on surgical phase, patient history and clinical objective.

View on arXiv PDF

Similar