Data Checklist: On Unit-Testing Datasets with Usable Information
This work addresses the need for systematic dataset evaluation in machine learning, particularly for LLMs, by providing a tool to detect artifacts early, though it is incremental in building on existing model checklist concepts.
The paper tackles the problem of ad hoc dataset evaluation by proposing a principled approach to unit-testing datasets using a taxonomy based on V-information, called a data checklist. The result includes recovering known artifacts in datasets like SNLI and discovering new ones in LLM alignment preference datasets, with data filtering improving the efficacy and data efficiency of preference alignment.
Model checklists (Ribeiro et al., 2020) have emerged as a useful tool for understanding the behavior of LLMs, analogous to unit-testing in software engineering. However, despite datasets being a key determinant of model behavior, evaluating datasets, e.g., for the existence of annotation artifacts, is largely done ad hoc, once a problem in model behavior has already been found downstream. In this work, we take a more principled approach to unit-testing datasets by proposing a taxonomy based on the V-information literature. We call a collection of such unit tests a data checklist. Using a checklist, not only are we able to recover known artifacts in well-known datasets such as SNLI, but we also discover previously unknown artifacts in preference datasets for LLM alignment. Data checklists further enable a new kind of data filtering, which we use to improve the efficacy and data efficiency of preference alignment.