CVLGJul 10, 2024

LEMoN: Label Error Detection using Multimodal Neighbors

arXiv:2407.18941v22 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the issue of noisy data for vision-language model training, offering a novel filtering approach with demonstrated improvements, though it is incremental in the context of label error detection methods.

The paper tackled the problem of mislabeled image-caption pairs in noisy datasets by proposing LEMoN, a method that uses multimodal neighborhoods to detect label errors, resulting in over 3% improvement in detection and over 2 BLEU points gain in downstream captioning performance.

Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes