CVJul 11, 2025

VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

arXiv:2507.09008v17 citationsh-index: 9IEEE Trans Vis Comput Graph
Originality Incremental advance
AI Analysis

This addresses the challenge of validating FM-generated labels in open-vocabulary image segmentation, offering a tool to improve data quality for researchers and practitioners, though it is incremental as it builds on existing validation methods.

The paper tackles the problem of low-quality labels generated by foundation models for large-scale datasets, introducing VISTA, a visual analytics framework that improves data quality and enhances multi-modal model performance, as demonstrated through use cases on benchmark datasets with expert reviews.

The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA's effectiveness from both quantitative and qualitative perspectives.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes