CVLGMay 9

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

arXiv:2605.090252.5
AI Analysis

For clinical deployment of federated brain tumor segmentation, this work highlights the need for robustness-oriented evaluation to detect site-specific failures that average metrics conceal.

Current federated learning evaluations report only average performance, masking failures at individual hospitals. MedFL-Stress reveals that FedAvg's high mean Dice (0.8159) hides a 0.0850 gap between best and worst hospital, while FedBN closes that gap by 41% with minimal mean accuracy loss, improving the weakest hospital by 3.5 Dice points.

Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes