CV AIJan 9, 2024

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson, Aerin Kim, Somayeh Sojoudi

arXiv:2401.04575v25.22 citationsh-index: 30

Originality Synthesis-oriented

AI Analysis

This provides a cleaner, more informative dataset for researchers and practitioners in e-commerce-focused computer vision, though it is incremental as it builds on existing data collection methods by targeting a specific domain.

The authors tackled the problem of limited large-scale annotated datasets for vision and vision-language tasks by introducing the Let's Go Shopping (LGS) dataset, a 15 million image-caption pair collection from e-commerce websites, which shows that existing classifiers do not generalize well to e-commerce data while specific self-supervised methods can, and it improves image captioning and text-to-image generation for e-commerce style transfer.

Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.

View on arXiv PDF

Similar