SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata
This work addresses the problem of responsible AI decision-making for dataset use and documentation, though it appears incremental as it builds on existing data analysis methods.
The authors tackled the challenge of analyzing human representation in unstructured data used for foundation models by proposing the SoUnD framework, which they applied to C4 and LAION-400M datasets to identify downstream risks.
The unstructured nature of data used in foundation model development is a challenge to systematic analyses for making data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework designed to guide analysis of human representation in unstructured data and identify downstream risks. We apply the framework in two toy examples using the Common Crawl web text corpus (C4) and LAION-400M. We also propose a set of hypothetical action steps in service of dataset use, development, and documentation.