CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis
This dataset addresses a critical gap for researchers in medical AI by providing a realistic testbed for handling expert disagreements, though it is incremental as it builds on existing dataset creation efforts.
The authors tackled the lack of datasets that capture real-world expert disagreement in medical image analysis by introducing CytoCrowd, a benchmark with 446 cytology images featuring both raw annotations from four pathologists and a separate gold standard, enabling evaluation of standard tasks and annotation aggregation algorithms.
High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.