Aisha Urooj

h-index5

2papers

84citations

2 Papers

9.6CVJul 9

Metrics or Mirage? An Audit of Evaluation Inconsistencies in Colonoscopy Polyp Segmentation Benchmarks

Aisha Urooj, Zain Ul Abdien, Neelu Madan

Progress in colonoscopy polyp segmentation is routinely reported through leaderboard comparisons on a small set of public benchmarks. We argue that this apparent progress is difficult to verify: a systematic audit of \textbf{27 papers} published between 2015 and 2026 reveals three structural problems in how the community evaluates models. \textbf{First}, 25 of 27 papers \textit{omit the Hausdorff distance}. Hausdorff distance is a boundary-accuracy metric with direct clinical relevance for detecting flat or small polyps, and is a standard in radiotherapy segmentation. \textbf{Second}, at least five \textit{incompatible train/test split protocols} co-exist across papers reporting results on the same two datasets (Kvasir-SEG and CVC-ClinicDB), making published Dice scores non-comparable even when they appear in the same leaderboard column. \textbf{Third}, 26 of 27 papers make \textit{performance claims without any statistical significance test}. Strikingly, four papers published \emph{after} the Metrics Reloaded framework~\cite{metricsreloaded2024} (Maier-Hein et al., \textit{Nature Methods} 2024) perpetuate these same problems, suggesting that general-purpose metric guidance has not yet reached the colonoscopy sub-community. To show these problems are not merely cosmetic, we re-evaluate five representative models under three controlled protocols with a single uniform scorer, and find that the reported metric conceals large boundary and recall failures, that the ``best'' model changes with the metric, and that near-tied rankings reverse across random splits. We propose a five-point \textbf{Polyp Segmentation Reporting Checklist}~(PSRC) as a lightweight, domain-adapted corrective.

19.7CVJun 9, 2025

CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray

Mingquan Lin, Gregory Holste, Song Wang et al.

The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated "gold standard" subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.