Model-based Cleaning of the QUILT-1M Pathology Dataset for Text-Conditional Image Synthesis
This work addresses data quality issues for researchers using the QUILT-1M dataset in text-conditional image synthesis, though it is incremental as it focuses on cleaning an existing dataset.
The researchers tackled the problem of heterogeneous image quality in the QUILT-1M pathology dataset by developing an automatic pipeline to predict and filter common impurities like narrators and text, resulting in substantial enhancement of image fidelity in text-to-image tasks.
The QUILT-1M dataset is the first openly available dataset containing images harvested from various online sources. While it provides a huge data variety, the image quality and composition is highly heterogeneous, impacting its utility for text-conditional image synthesis. We propose an automatic pipeline that provides predictions of the most common impurities within the images, e.g., visibility of narrators, desktop environment and pathology software, or text within the image. Additionally, we propose to use semantic alignment filtering of the image-text pairs. Our findings demonstrate that by rigorously filtering the dataset, there is a substantial enhancement of image fidelity in text-to-image tasks.