No Annotations for Object Detection in Art through Stable Diffusion
This addresses the challenge of annotating artistic images for digital humanities by reducing the need for specialized domain expertise, though it is incremental in applying existing models to a new domain.
The paper tackles object detection in art without full bounding box supervision by leveraging diffusion models, achieving state-of-the-art performance in weakly-supervised detection and pioneering zero-shot detection on artwork datasets like ArtDL 2.0 and IconArt.
Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at https://github.com/patrick-john-ramos/nada