CV CL LGApr 8, 2021

Dataset Summarization by K Principal Concepts

arXiv:2104.03952v25.63 citations

Originality Incremental advance

AI Analysis

This addresses the need for interpretable dataset summarization for researchers and practitioners, though it is incremental as it builds on existing embedding methods.

The paper tackles the problem of summarizing datasets by identifying K principal concepts that explain data variation, using an image-language embedding and optimization to select concepts from a large candidate bank, and demonstrates efficacy through extensive experiments.

We propose the new task of K principal concept identification for dataset summarizarion. The objective is to find a set of K concepts that best explain the variation within the dataset. Concepts are high-level human interpretable terms such as "tiger", "kayaking" or "happy". The K concepts are selected from a (potentially long) input list of candidates, which we denote the concept-bank. The concept-bank may be taken from a generic dictionary or constructed by task-specific prior knowledge. An image-language embedding method (e.g. CLIP) is used to map the images and the concept-bank into a shared feature space. To select the K concepts that best explain the data, we formulate our problem as a K-uncapacitated facility location problem. An efficient optimization technique is used to scale the local search algorithm to very large concept-banks. The output of our method is a set of K principal concepts that summarize the dataset. Our approach provides a more explicit summary in comparison to selecting K representative images, which are often ambiguous. As a further application of our method, the K principal concepts can be used to classify the dataset into K groups. Extensive experiments demonstrate the efficacy of our approach.

View on arXiv PDF

Similar