LG CYNov 28, 2023

SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata

Mark Díaz, Sunipa Dev, Emily Reif, Emily Denton, Vinodkumar Prabhakaran

arXiv:2311.17259v27.76 citationsh-index: 17

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of responsible AI decision-making for dataset use and documentation, though it appears incremental as it builds on existing data analysis methods.

The authors tackled the challenge of analyzing human representation in unstructured data used for foundation models by proposing the SoUnD framework, which they applied to C4 and LAION-400M datasets to identify downstream risks.

The unstructured nature of data used in foundation model development is a challenge to systematic analyses for making data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework designed to guide analysis of human representation in unstructured data and identify downstream risks. We apply the framework in two toy examples using the Common Crawl web text corpus (C4) and LAION-400M. We also propose a set of hypothetical action steps in service of dataset use, development, and documentation.

View on arXiv PDF

Similar