CVAIOct 30, 2025

ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

arXiv:2510.26186v13 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses dataset auditing and model diagnostics for researchers and practitioners, offering a scalable tool to identify biases without costly annotations, though it is incremental as it builds on existing methods for concept discovery.

The paper tackles the problem of dataset bias in machine learning by introducing ConceptScope, a framework that automatically discovers and quantifies human-interpretable visual concepts using Sparse Autoencoders on vision foundation models, enabling bias identification and robustness evaluation, with validation showing it captures diverse concepts and detects both known and new biases.

Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes