CVMay 7

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

Fakrul Islam Tushar, Umme Hafsa Momy, Joseph Y. Lo, Geoffrey D. Rubin

arXiv:2605.0576144.7h-index: 10

Predicted impact top 75% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For medical imaging researchers, iTRIALSPACE provides a controlled, auditable evaluation infrastructure that reveals model failures hidden by static retrospective benchmarks.

iTRIALSPACE is a programmable evaluation framework for lung CT models that composes real clinical CTs and lesion profiles into controlled virtual lesion trials. It exposes shortcut-driven size prediction collapse under lobe-equalized sampling and host-to-donor variance ratios of 8.9x and 3.3x, with synthetic performance rankings transferring strongly to real clinical data (ρ=0.93).

We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.

View on arXiv PDF

Similar