IV AI CV LGSep 23, 2024

Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images

Ahjol Senbi, Tianyu Huang, Fei Lyu, Qing Li, Yuhui Tao, Wei Shao, Qiang Chen, Chengyan Wang, Shuo Wang, Tao Zhou, Yizhe Zhang

arXiv:2409.14874v210.33 citationsh-index: 53Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for automated quality assessment in medical image segmentation, particularly for models like SAM, but it is incremental as it builds on prior regression frameworks and existing datasets.

The paper tackles the problem of evaluating segmentation quality in medical images without ground truth by proposing EvanySeg, a model that estimates segmentation scores based on image-segmentation coherence, achieving better performance with transformer-based architectures like ViT compared to convolution-based ones.

We explore the feasibility and potential of building a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging. This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions. Based on prior research, we frame the task of training this model as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) along with mean squared error to compute the training loss. The model is trained utilizing a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) suggested that ViT yields better performance for this task. EvanySeg can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmarking segmentation models without ground truth by averaging quality scores across test samples; (3) alerting human experts to poor-quality segmentation predictions during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest quality score. Models and code will be made available at https://github.com/ahjolsenbics/EvanySeg.

View on arXiv PDF Code

Similar