Parrot Captions Teach CLIP to Spot Text
This highlights a critical flaw in widely used vision-language models and datasets, urging a redesign of CLIP-like models or dataset curation pipelines to improve robustness for AI applications.
The paper identified that CLIP models exhibit a text spotting bias, where they focus on embedded text in images rather than visual semantics, due to parrot captions in the LAION-2B dataset, with 50% of images containing visual text and 30% of caption words matching it, and showed that training with such captions harms visual-language representation learning.
Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.