Understanding the Dataset Practitioners Behind Large Language Model Development
This work addresses the problem of inconsistent data quality practices for dataset practitioners in AI development, but it is incremental as it primarily describes current challenges without proposing new solutions.
The study investigated the role of dataset practitioners in large language model development at Google, finding that while data quality is a priority, there is little consensus on its definition and evaluation methods, leading practitioners to rely on intuition or custom code.
As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.